From: Zi Yan <zi.yan(a)cs.rutgers.edu>
In [1], Andrea reported that during memory hotplug/hot remove
prep_transhuge_page() is called incorrectly on non-THP pages for
migration, when THP is on but THP migration is not enabled.
This leads to a bad state of target pages for migration.
This patch fixes it by only calling prep_transhuge_page() when we are
certain that the target page is THP.
[1] https://lkml.org/lkml/2017/11/20/411
Cc: stable(a)vger.kernel.org # v4.14
Fixes: 8135d8926c08 ("mm: memory_hotplug: memory hotremove supports thp migration")
Reported-by: Andrea Reale <ar(a)linux.vnet.ibm.com>
Signed-off-by: Zi Yan <zi.yan(a)cs.rutgers.edu>
Cc: Naoya Horiguchi <n-horiguchi(a)ah.jp.nec.com>
Cc: "Jérôme Glisse" <jglisse(a)redhat.com>
---
include/linux/migrate.h | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/include/linux/migrate.h b/include/linux/migrate.h
index 895ec0c4942e..a2246cf670ba 100644
--- a/include/linux/migrate.h
+++ b/include/linux/migrate.h
@@ -54,7 +54,7 @@ static inline struct page *new_page_nodemask(struct page *page,
new_page = __alloc_pages_nodemask(gfp_mask, order,
preferred_nid, nodemask);
- if (new_page && PageTransHuge(page))
+ if (new_page && PageTransHuge(new_page))
prep_transhuge_page(new_page);
return new_page;
--
2.14.2
On Wed 22-11-17 07:29:38, Zi Yan wrote:
> On 22 Nov 2017, at 7:13, Zi Yan wrote:
>
> > On 22 Nov 2017, at 5:14, Michal Hocko wrote:
> >
> >> On Wed 22-11-17 10:35:10, Michal Hocko wrote:
> >> [...]
> >>> Moreover I am not really sure this is really working properly. Just look
> >>> at the split_huge_page. It moves all the tail pages to the LRU list
> >>> while migrate_pages has a list of pages to migrate. So we will migrate
> >>> the head page and all the rest will get back to the LRU list. What
> >>> guarantees that they will get migrated as well.
> >>
> >> OK, so this is as I've expected. It doesn't work! Some pfn walker based
> >> migration will just skip tail pages see madvise_inject_error.
> >> __alloc_contig_migrate_range will simply fail on THP page see
> >> isolate_migratepages_block so we even do not try to migrate it.
> >> do_move_page_to_node_array will simply migrate head and do not care
> >> about tail pages. do_mbind splits the page and then fall back to pte
> >> walk when thp migration is not supported but it doesn't handle tail
> >> pages if the THP migration path is not able to allocate a fresh THP
> >> AFAICS. Memory hotplug should be safe because it doesn't skip the whole
> >> THP when doing pfn walk.
> >>
> >> Unless I am missing something here this looks like a huge mess to me.
> >
> > +Kirill
> >
> > First, I agree with you that splitting a THP and only migrating its head page
> > is a mess. But what you describe is also the behavior of migrate_page()
> > _before_ THP migration support is added. I thought that was intended.
> >
> > Look at http://elixir.free-electrons.com/linux/v4.13.15/source/mm/migrate.c#L1091,
> > unmap_and_move() splits THPs and only migrates the head page in v4.13 before THP
> > migration is added. I think the behavior was introduced since v4.5 (I just skimmed
> > v4.0 to v4.13 code and did not have time to use git blame), before that THPs are
> > not migrated but shown as successfully migrated (at least from v4.4’s code).
>
> Sorry, I misread v4.4’s code, it also does ‘splitting a THP and migrating its head page’.
> This behavior was there for a long time, at least since v3.0.
>
> The code in unmap_and_move() is:
>
> if (unlikely(PageTransHuge(page)))
> if (unlikely(split_huge_page(page)))
> goto out;
I _think_ that this all should be handled at migrate_pages layer. Try to
migrate THP and fallback to split_huge_page into to the list when it
fails. I haven't checked whether there is something which would prevent
that though. THP tricks in specific paths then should be removed.
--
Michal Hocko
SUSE Labs
I've made mistake during converting hugetlb code to 5-level paging:
in huge_pte_alloc() we have to use p4d_alloc(), not p4d_offset().
Otherwise it leads to crash -- NULL-pointer dereference in pud_alloc()
if p4d table is not yet allocated.
It only can happen in 5-level paging mode. In 4-level paging mode
p4d_offset() always returns pgd, so we are fine.
Signed-off-by: Kirill A. Shutemov <kirill.shutemov(a)linux.intel.com>
Fixes: c2febafc6773 ("mm: convert generic code to 5-level paging")
Cc: <stable(a)vger.kernel.org> # v4.11+
---
mm/hugetlb.c | 4 +++-
1 file changed, 3 insertions(+), 1 deletion(-)
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 2d2ff5e8bf2b..94a4c0b63580 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -4617,7 +4617,9 @@ pte_t *huge_pte_alloc(struct mm_struct *mm,
pte_t *pte = NULL;
pgd = pgd_offset(mm, addr);
- p4d = p4d_offset(pgd, addr);
+ p4d = p4d_alloc(mm, pgd, addr);
+ if (!p4d)
+ return NULL;
pud = pud_alloc(mm, p4d, addr);
if (pud) {
if (sz == PUD_SIZE) {
--
2.15.0
Hi,
Is there a way to get notifications about patches queued up for the
-rc1 commits in linux-stable-rc.git? I currently use the push that is
associated with email to this list from Greg. But that only gives 48h
window to do the building and testing. If there is a way to get the
patches earlier, that would give more time for testing activities. Any
hints are much appreciated.
Best Regards,
Milosz
On Wed 22-11-17 04:18:35, Zi Yan wrote:
> On 22 Nov 2017, at 3:54, Michal Hocko wrote:
[...]
> > I would keep the two checks consistent. But that leads to a more
> > interesting question. new_page_nodemask does
> >
> > if (thp_migration_supported() && PageTransHuge(page)) {
> > order = HPAGE_PMD_ORDER;
> > gfp_mask |= GFP_TRANSHUGE;
> > }
> >
> > How come it is safe to allocate an order-0 page if
> > !thp_migration_supported() when we are about to migrate THP? This
> > doesn't make any sense to me. Are we working around this somewhere else?
> > Why shouldn't we simply return NULL here?
>
> If !thp_migration_supported(), we will first split a THP and migrate
> its head page. This process is done in unmap_and_move() after
> get_new_page() (the function pointer to this new_page_nodemask()) is
> called. The situation can be PageTransHuge(page) is true here, but the
> page is split in unmap_and_move(), so we want to return a order-0 page
> here.
This deserves a big fat comment in the code because this is not clear
from the code!
> I think the confusion comes from that there is no guarantee of THP
> allocation when we are doing THP migration. If we can allocate a THP
> during THP migration, we are good. Otherwise, we want to fallback to
> the old way, splitting the original THP and migrating the head page,
> to preserve the original code behavior.
I understand that but that should be done explicitly rather than relying
on two functions doing the right thing because this is just too fragile.
Moreover I am not really sure this is really working properly. Just look
at the split_huge_page. It moves all the tail pages to the LRU list
while migrate_pages has a list of pages to migrate. So we will migrate
the head page and all the rest will get back to the LRU list. What
guarantees that they will get migrated as well.
This all looks like a mess!
--
Michal Hocko
SUSE Labs
On Tue 21 Nov 2017, 17:35, Zi Yan wrote:
> On 21 Nov 2017, at 17:12, Andrew Morton wrote:
>
> > On Mon, 20 Nov 2017 21:18:55 -0500 Zi Yan <zi.yan(a)sent.com> wrote:
> >
> >> This patch fixes it by only calling prep_transhuge_page() when we are
> >> certain that the target page is THP.
> >
> > What are the user-visible effects of the bug?
>
> By inspecting the code, if called on a non-THP, prep_transhuge_page() will
> 1) change the value of the mapping of (page + 2), since it is used for THP deferred list;
> 2) change the lru value of (page + 1), since it is used for THP’s dtor.
>
> Both can lead to data corruption of these two pages.
Pragmatically and from the point of view of the memory_hotplug subsys,
the effect is a kernel crash when pages are being migrated during a memory
hot remove offline and migration target pages are found in a bad state.
Best,
Andrea
This is a note to let you know that I've just added the patch titled
coda: fix 'kernel memory exposure attempt' in fsync
to the 4.9-stable tree which can be found at:
http://www.kernel.org/git/?p=linux/kernel/git/stable/stable-queue.git;a=sum…
The filename of the patch is:
coda-fix-kernel-memory-exposure-attempt-in-fsync.patch
and it can be found in the queue-4.9 subdirectory.
If you, or anyone else, feels it should not be added to the stable tree,
please let <stable(a)vger.kernel.org> know about it.
>From d337b66a4c52c7b04eec661d86c2ef6e168965a2 Mon Sep 17 00:00:00 2001
From: Jan Harkes <jaharkes(a)cs.cmu.edu>
Date: Wed, 27 Sep 2017 15:52:12 -0400
Subject: coda: fix 'kernel memory exposure attempt' in fsync
From: Jan Harkes <jaharkes(a)cs.cmu.edu>
commit d337b66a4c52c7b04eec661d86c2ef6e168965a2 upstream.
When an application called fsync on a file in Coda a small request with
just the file identifier was allocated, but the declared length was set
to the size of union of all possible upcall requests.
This bug has been around for a very long time and is now caught by the
extra checking in usercopy that was introduced in Linux-4.8.
The exposure happens when the Coda cache manager process reads the fsync
upcall request at which point it is killed. As a result there is nobody
servicing any further upcalls, trapping any processes that try to access
the mounted Coda filesystem.
Signed-off-by: Jan Harkes <jaharkes(a)cs.cmu.edu>
Signed-off-by: Al Viro <viro(a)zeniv.linux.org.uk>
Signed-off-by: Greg Kroah-Hartman <gregkh(a)linuxfoundation.org>
---
fs/coda/upcall.c | 3 +--
1 file changed, 1 insertion(+), 2 deletions(-)
--- a/fs/coda/upcall.c
+++ b/fs/coda/upcall.c
@@ -446,8 +446,7 @@ int venus_fsync(struct super_block *sb,
UPARG(CODA_FSYNC);
inp->coda_fsync.VFid = *fid;
- error = coda_upcall(coda_vcp(sb), sizeof(union inputArgs),
- &outsize, inp);
+ error = coda_upcall(coda_vcp(sb), insize, &outsize, inp);
CODA_FREE(inp, insize);
return error;
Patches currently in stable-queue which might be from jaharkes(a)cs.cmu.edu are
queue-4.9/coda-fix-kernel-memory-exposure-attempt-in-fsync.patch
This is a note to let you know that I've just added the patch titled
coda: fix 'kernel memory exposure attempt' in fsync
to the 4.4-stable tree which can be found at:
http://www.kernel.org/git/?p=linux/kernel/git/stable/stable-queue.git;a=sum…
The filename of the patch is:
coda-fix-kernel-memory-exposure-attempt-in-fsync.patch
and it can be found in the queue-4.4 subdirectory.
If you, or anyone else, feels it should not be added to the stable tree,
please let <stable(a)vger.kernel.org> know about it.
>From d337b66a4c52c7b04eec661d86c2ef6e168965a2 Mon Sep 17 00:00:00 2001
From: Jan Harkes <jaharkes(a)cs.cmu.edu>
Date: Wed, 27 Sep 2017 15:52:12 -0400
Subject: coda: fix 'kernel memory exposure attempt' in fsync
From: Jan Harkes <jaharkes(a)cs.cmu.edu>
commit d337b66a4c52c7b04eec661d86c2ef6e168965a2 upstream.
When an application called fsync on a file in Coda a small request with
just the file identifier was allocated, but the declared length was set
to the size of union of all possible upcall requests.
This bug has been around for a very long time and is now caught by the
extra checking in usercopy that was introduced in Linux-4.8.
The exposure happens when the Coda cache manager process reads the fsync
upcall request at which point it is killed. As a result there is nobody
servicing any further upcalls, trapping any processes that try to access
the mounted Coda filesystem.
Signed-off-by: Jan Harkes <jaharkes(a)cs.cmu.edu>
Signed-off-by: Al Viro <viro(a)zeniv.linux.org.uk>
Signed-off-by: Greg Kroah-Hartman <gregkh(a)linuxfoundation.org>
---
fs/coda/upcall.c | 3 +--
1 file changed, 1 insertion(+), 2 deletions(-)
--- a/fs/coda/upcall.c
+++ b/fs/coda/upcall.c
@@ -446,8 +446,7 @@ int venus_fsync(struct super_block *sb,
UPARG(CODA_FSYNC);
inp->coda_fsync.VFid = *fid;
- error = coda_upcall(coda_vcp(sb), sizeof(union inputArgs),
- &outsize, inp);
+ error = coda_upcall(coda_vcp(sb), insize, &outsize, inp);
CODA_FREE(inp, insize);
return error;
Patches currently in stable-queue which might be from jaharkes(a)cs.cmu.edu are
queue-4.4/coda-fix-kernel-memory-exposure-attempt-in-fsync.patch
This is a note to let you know that I've just added the patch titled
coda: fix 'kernel memory exposure attempt' in fsync
to the 4.14-stable tree which can be found at:
http://www.kernel.org/git/?p=linux/kernel/git/stable/stable-queue.git;a=sum…
The filename of the patch is:
coda-fix-kernel-memory-exposure-attempt-in-fsync.patch
and it can be found in the queue-4.14 subdirectory.
If you, or anyone else, feels it should not be added to the stable tree,
please let <stable(a)vger.kernel.org> know about it.
>From d337b66a4c52c7b04eec661d86c2ef6e168965a2 Mon Sep 17 00:00:00 2001
From: Jan Harkes <jaharkes(a)cs.cmu.edu>
Date: Wed, 27 Sep 2017 15:52:12 -0400
Subject: coda: fix 'kernel memory exposure attempt' in fsync
From: Jan Harkes <jaharkes(a)cs.cmu.edu>
commit d337b66a4c52c7b04eec661d86c2ef6e168965a2 upstream.
When an application called fsync on a file in Coda a small request with
just the file identifier was allocated, but the declared length was set
to the size of union of all possible upcall requests.
This bug has been around for a very long time and is now caught by the
extra checking in usercopy that was introduced in Linux-4.8.
The exposure happens when the Coda cache manager process reads the fsync
upcall request at which point it is killed. As a result there is nobody
servicing any further upcalls, trapping any processes that try to access
the mounted Coda filesystem.
Signed-off-by: Jan Harkes <jaharkes(a)cs.cmu.edu>
Signed-off-by: Al Viro <viro(a)zeniv.linux.org.uk>
Signed-off-by: Greg Kroah-Hartman <gregkh(a)linuxfoundation.org>
---
fs/coda/upcall.c | 3 +--
1 file changed, 1 insertion(+), 2 deletions(-)
--- a/fs/coda/upcall.c
+++ b/fs/coda/upcall.c
@@ -447,8 +447,7 @@ int venus_fsync(struct super_block *sb,
UPARG(CODA_FSYNC);
inp->coda_fsync.VFid = *fid;
- error = coda_upcall(coda_vcp(sb), sizeof(union inputArgs),
- &outsize, inp);
+ error = coda_upcall(coda_vcp(sb), insize, &outsize, inp);
CODA_FREE(inp, insize);
return error;
Patches currently in stable-queue which might be from jaharkes(a)cs.cmu.edu are
queue-4.14/coda-fix-kernel-memory-exposure-attempt-in-fsync.patch
This is a note to let you know that I've just added the patch titled
coda: fix 'kernel memory exposure attempt' in fsync
to the 4.13-stable tree which can be found at:
http://www.kernel.org/git/?p=linux/kernel/git/stable/stable-queue.git;a=sum…
The filename of the patch is:
coda-fix-kernel-memory-exposure-attempt-in-fsync.patch
and it can be found in the queue-4.13 subdirectory.
If you, or anyone else, feels it should not be added to the stable tree,
please let <stable(a)vger.kernel.org> know about it.
>From d337b66a4c52c7b04eec661d86c2ef6e168965a2 Mon Sep 17 00:00:00 2001
From: Jan Harkes <jaharkes(a)cs.cmu.edu>
Date: Wed, 27 Sep 2017 15:52:12 -0400
Subject: coda: fix 'kernel memory exposure attempt' in fsync
From: Jan Harkes <jaharkes(a)cs.cmu.edu>
commit d337b66a4c52c7b04eec661d86c2ef6e168965a2 upstream.
When an application called fsync on a file in Coda a small request with
just the file identifier was allocated, but the declared length was set
to the size of union of all possible upcall requests.
This bug has been around for a very long time and is now caught by the
extra checking in usercopy that was introduced in Linux-4.8.
The exposure happens when the Coda cache manager process reads the fsync
upcall request at which point it is killed. As a result there is nobody
servicing any further upcalls, trapping any processes that try to access
the mounted Coda filesystem.
Signed-off-by: Jan Harkes <jaharkes(a)cs.cmu.edu>
Signed-off-by: Al Viro <viro(a)zeniv.linux.org.uk>
Signed-off-by: Greg Kroah-Hartman <gregkh(a)linuxfoundation.org>
---
fs/coda/upcall.c | 3 +--
1 file changed, 1 insertion(+), 2 deletions(-)
--- a/fs/coda/upcall.c
+++ b/fs/coda/upcall.c
@@ -446,8 +446,7 @@ int venus_fsync(struct super_block *sb,
UPARG(CODA_FSYNC);
inp->coda_fsync.VFid = *fid;
- error = coda_upcall(coda_vcp(sb), sizeof(union inputArgs),
- &outsize, inp);
+ error = coda_upcall(coda_vcp(sb), insize, &outsize, inp);
CODA_FREE(inp, insize);
return error;
Patches currently in stable-queue which might be from jaharkes(a)cs.cmu.edu are
queue-4.13/coda-fix-kernel-memory-exposure-attempt-in-fsync.patch
This is a note to let you know that I've just added the patch titled
coda: fix 'kernel memory exposure attempt' in fsync
to the 3.18-stable tree which can be found at:
http://www.kernel.org/git/?p=linux/kernel/git/stable/stable-queue.git;a=sum…
The filename of the patch is:
coda-fix-kernel-memory-exposure-attempt-in-fsync.patch
and it can be found in the queue-3.18 subdirectory.
If you, or anyone else, feels it should not be added to the stable tree,
please let <stable(a)vger.kernel.org> know about it.
>From d337b66a4c52c7b04eec661d86c2ef6e168965a2 Mon Sep 17 00:00:00 2001
From: Jan Harkes <jaharkes(a)cs.cmu.edu>
Date: Wed, 27 Sep 2017 15:52:12 -0400
Subject: coda: fix 'kernel memory exposure attempt' in fsync
From: Jan Harkes <jaharkes(a)cs.cmu.edu>
commit d337b66a4c52c7b04eec661d86c2ef6e168965a2 upstream.
When an application called fsync on a file in Coda a small request with
just the file identifier was allocated, but the declared length was set
to the size of union of all possible upcall requests.
This bug has been around for a very long time and is now caught by the
extra checking in usercopy that was introduced in Linux-4.8.
The exposure happens when the Coda cache manager process reads the fsync
upcall request at which point it is killed. As a result there is nobody
servicing any further upcalls, trapping any processes that try to access
the mounted Coda filesystem.
Signed-off-by: Jan Harkes <jaharkes(a)cs.cmu.edu>
Signed-off-by: Al Viro <viro(a)zeniv.linux.org.uk>
Signed-off-by: Greg Kroah-Hartman <gregkh(a)linuxfoundation.org>
---
fs/coda/upcall.c | 3 +--
1 file changed, 1 insertion(+), 2 deletions(-)
--- a/fs/coda/upcall.c
+++ b/fs/coda/upcall.c
@@ -446,8 +446,7 @@ int venus_fsync(struct super_block *sb,
UPARG(CODA_FSYNC);
inp->coda_fsync.VFid = *fid;
- error = coda_upcall(coda_vcp(sb), sizeof(union inputArgs),
- &outsize, inp);
+ error = coda_upcall(coda_vcp(sb), insize, &outsize, inp);
CODA_FREE(inp, insize);
return error;
Patches currently in stable-queue which might be from jaharkes(a)cs.cmu.edu are
queue-3.18/coda-fix-kernel-memory-exposure-attempt-in-fsync.patch
The patch below does not apply to the 4.13-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
>From e9a6effa500526e2a19d5ad042cb758b55b1ef93 Mon Sep 17 00:00:00 2001
From: Huang Ying <huang.ying.caritas(a)gmail.com>
Date: Wed, 15 Nov 2017 17:33:15 -0800
Subject: [PATCH] mm, swap: fix false error message in __swp_swapcount()
When a page fault occurs for a swap entry, the physical swap readahead
(not the VMA base swap readahead) may readahead several swap entries
after the fault swap entry. The readahead algorithm calculates some of
the swap entries to readahead via increasing the offset of the fault
swap entry without checking whether they are beyond the end of the swap
device and it relys on the __swp_swapcount() and swapcache_prepare() to
check it. Although __swp_swapcount() checks for the swap entry passed
in, it will complain with the error message as follow for the expected
invalid swap entry. This may make the end users confused.
swap_info_get: Bad swap offset entry 0200f8a7
To fix the false error message, the swap entry checking is added in
swapin_readahead() to avoid to pass the out-of-bound swap entries and
the swap entry reserved for the swap header to __swp_swapcount() and
swapcache_prepare().
Link: http://lkml.kernel.org/r/20171102054225.22897-1-ying.huang@intel.com
Fixes: e8c26ab60598 ("mm/swap: skip readahead for unreferenced swap slots")
Signed-off-by: "Huang, Ying" <ying.huang(a)intel.com>
Reported-by: Christian Kujau <lists(a)nerdbynature.de>
Acked-by: Minchan Kim <minchan(a)kernel.org>
Suggested-by: Minchan Kim <minchan(a)kernel.org>
Cc: Tim Chen <tim.c.chen(a)linux.intel.com>
Cc: Michal Hocko <mhocko(a)suse.com>
Cc: Hugh Dickins <hughd(a)google.com>
Cc: <stable(a)vger.kernel.org> [4.11+]
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds(a)linux-foundation.org>
diff --git a/mm/swap_state.c b/mm/swap_state.c
index 326439428daf..f2face8b889e 100644
--- a/mm/swap_state.c
+++ b/mm/swap_state.c
@@ -559,6 +559,7 @@ struct page *swapin_readahead(swp_entry_t entry, gfp_t gfp_mask,
unsigned long offset = entry_offset;
unsigned long start_offset, end_offset;
unsigned long mask;
+ struct swap_info_struct *si = swp_swap_info(entry);
struct blk_plug plug;
bool do_poll = true, page_allocated;
@@ -572,6 +573,8 @@ struct page *swapin_readahead(swp_entry_t entry, gfp_t gfp_mask,
end_offset = offset | mask;
if (!start_offset) /* First page is swap header. */
start_offset++;
+ if (end_offset >= si->max)
+ end_offset = si->max - 1;
blk_start_plug(&plug);
for (offset = start_offset; offset <= end_offset ; offset++) {
The patch below does not apply to the 4.14-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
>From e9a6effa500526e2a19d5ad042cb758b55b1ef93 Mon Sep 17 00:00:00 2001
From: Huang Ying <huang.ying.caritas(a)gmail.com>
Date: Wed, 15 Nov 2017 17:33:15 -0800
Subject: [PATCH] mm, swap: fix false error message in __swp_swapcount()
When a page fault occurs for a swap entry, the physical swap readahead
(not the VMA base swap readahead) may readahead several swap entries
after the fault swap entry. The readahead algorithm calculates some of
the swap entries to readahead via increasing the offset of the fault
swap entry without checking whether they are beyond the end of the swap
device and it relys on the __swp_swapcount() and swapcache_prepare() to
check it. Although __swp_swapcount() checks for the swap entry passed
in, it will complain with the error message as follow for the expected
invalid swap entry. This may make the end users confused.
swap_info_get: Bad swap offset entry 0200f8a7
To fix the false error message, the swap entry checking is added in
swapin_readahead() to avoid to pass the out-of-bound swap entries and
the swap entry reserved for the swap header to __swp_swapcount() and
swapcache_prepare().
Link: http://lkml.kernel.org/r/20171102054225.22897-1-ying.huang@intel.com
Fixes: e8c26ab60598 ("mm/swap: skip readahead for unreferenced swap slots")
Signed-off-by: "Huang, Ying" <ying.huang(a)intel.com>
Reported-by: Christian Kujau <lists(a)nerdbynature.de>
Acked-by: Minchan Kim <minchan(a)kernel.org>
Suggested-by: Minchan Kim <minchan(a)kernel.org>
Cc: Tim Chen <tim.c.chen(a)linux.intel.com>
Cc: Michal Hocko <mhocko(a)suse.com>
Cc: Hugh Dickins <hughd(a)google.com>
Cc: <stable(a)vger.kernel.org> [4.11+]
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds(a)linux-foundation.org>
diff --git a/mm/swap_state.c b/mm/swap_state.c
index 326439428daf..f2face8b889e 100644
--- a/mm/swap_state.c
+++ b/mm/swap_state.c
@@ -559,6 +559,7 @@ struct page *swapin_readahead(swp_entry_t entry, gfp_t gfp_mask,
unsigned long offset = entry_offset;
unsigned long start_offset, end_offset;
unsigned long mask;
+ struct swap_info_struct *si = swp_swap_info(entry);
struct blk_plug plug;
bool do_poll = true, page_allocated;
@@ -572,6 +573,8 @@ struct page *swapin_readahead(swp_entry_t entry, gfp_t gfp_mask,
end_offset = offset | mask;
if (!start_offset) /* First page is swap header. */
start_offset++;
+ if (end_offset >= si->max)
+ end_offset = si->max - 1;
blk_start_plug(&plug);
for (offset = start_offset; offset <= end_offset ; offset++) {
This is a note to let you know that I've just added the patch titled
mm, swap: fix false error message in __swp_swapcount()
to the 4.13-stable tree which can be found at:
http://www.kernel.org/git/?p=linux/kernel/git/stable/stable-queue.git;a=sum…
The filename of the patch is:
mm-swap-fix-false-error-message-in-__swp_swapcount.patch
and it can be found in the queue-4.13 subdirectory.
If you, or anyone else, feels it should not be added to the stable tree,
please let <stable(a)vger.kernel.org> know about it.
>From e9a6effa500526e2a19d5ad042cb758b55b1ef93 Mon Sep 17 00:00:00 2001
From: Huang Ying <huang.ying.caritas(a)gmail.com>
Date: Wed, 15 Nov 2017 17:33:15 -0800
Subject: mm, swap: fix false error message in __swp_swapcount()
From: Huang Ying <huang.ying.caritas(a)gmail.com>
commit e9a6effa500526e2a19d5ad042cb758b55b1ef93 upstream.
When a page fault occurs for a swap entry, the physical swap readahead
(not the VMA base swap readahead) may readahead several swap entries
after the fault swap entry. The readahead algorithm calculates some of
the swap entries to readahead via increasing the offset of the fault
swap entry without checking whether they are beyond the end of the swap
device and it relys on the __swp_swapcount() and swapcache_prepare() to
check it. Although __swp_swapcount() checks for the swap entry passed
in, it will complain with the error message as follow for the expected
invalid swap entry. This may make the end users confused.
swap_info_get: Bad swap offset entry 0200f8a7
To fix the false error message, the swap entry checking is added in
swapin_readahead() to avoid to pass the out-of-bound swap entries and
the swap entry reserved for the swap header to __swp_swapcount() and
swapcache_prepare().
Link: http://lkml.kernel.org/r/20171102054225.22897-1-ying.huang@intel.com
Fixes: e8c26ab60598 ("mm/swap: skip readahead for unreferenced swap slots")
Signed-off-by: "Huang, Ying" <ying.huang(a)intel.com>
Reported-by: Christian Kujau <lists(a)nerdbynature.de>
Acked-by: Minchan Kim <minchan(a)kernel.org>
Suggested-by: Minchan Kim <minchan(a)kernel.org>
Cc: Tim Chen <tim.c.chen(a)linux.intel.com>
Cc: Michal Hocko <mhocko(a)suse.com>
Cc: Hugh Dickins <hughd(a)google.com>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds(a)linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh(a)linuxfoundation.org>
---
mm/swap_state.c | 3 +++
1 file changed, 3 insertions(+)
--- a/mm/swap_state.c
+++ b/mm/swap_state.c
@@ -506,6 +506,7 @@ struct page *swapin_readahead(swp_entry_
unsigned long offset = entry_offset;
unsigned long start_offset, end_offset;
unsigned long mask;
+ struct swap_info_struct *si = swp_swap_info(entry);
struct blk_plug plug;
bool do_poll = true;
@@ -519,6 +520,8 @@ struct page *swapin_readahead(swp_entry_
end_offset = offset | mask;
if (!start_offset) /* First page is swap header. */
start_offset++;
+ if (end_offset >= si->max)
+ end_offset = si->max - 1;
blk_start_plug(&plug);
for (offset = start_offset; offset <= end_offset ; offset++) {
Patches currently in stable-queue which might be from huang.ying.caritas(a)gmail.com are
queue-4.13/mm-swap-fix-false-error-message-in-__swp_swapcount.patch
This is a note to let you know that I've just added the patch titled
x86/cpu/amd: Derive L3 shared_cpu_map from cpu_llc_shared_mask
to the 4.13-stable tree which can be found at:
http://www.kernel.org/git/?p=linux/kernel/git/stable/stable-queue.git;a=sum…
The filename of the patch is:
x86-cpu-amd-derive-l3-shared_cpu_map-from-cpu_llc_shared_mask.patch
and it can be found in the queue-4.13 subdirectory.
If you, or anyone else, feels it should not be added to the stable tree,
please let <stable(a)vger.kernel.org> know about it.
>From 2b83809a5e6d619a780876fcaf68cdc42b50d28c Mon Sep 17 00:00:00 2001
From: Suravee Suthikulpanit <suravee.suthikulpanit(a)amd.com>
Date: Mon, 31 Jul 2017 10:51:59 +0200
Subject: x86/cpu/amd: Derive L3 shared_cpu_map from cpu_llc_shared_mask
From: Suravee Suthikulpanit <suravee.suthikulpanit(a)amd.com>
commit 2b83809a5e6d619a780876fcaf68cdc42b50d28c upstream.
For systems with X86_FEATURE_TOPOEXT, current logic uses the APIC ID
to calculate shared_cpu_map. However, APIC IDs are not guaranteed to
be contiguous for cores across different L3s (e.g. family17h system
w/ downcore configuration). This breaks the logic, and results in an
incorrect L3 shared_cpu_map.
Instead, always use the previously calculated cpu_llc_shared_mask of
each CPU to derive the L3 shared_cpu_map.
Signed-off-by: Suravee Suthikulpanit <suravee.suthikulpanit(a)amd.com>
Signed-off-by: Borislav Petkov <bp(a)suse.de>
Cc: Linus Torvalds <torvalds(a)linux-foundation.org>
Cc: Peter Zijlstra <peterz(a)infradead.org>
Cc: Thomas Gleixner <tglx(a)linutronix.de>
Link: http://lkml.kernel.org/r/20170731085159.9455-3-bp@alien8.de
Signed-off-by: Ingo Molnar <mingo(a)kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh(a)linuxfoundation.org>
---
arch/x86/kernel/cpu/intel_cacheinfo.c | 32 ++++++++++++++++++--------------
1 file changed, 18 insertions(+), 14 deletions(-)
--- a/arch/x86/kernel/cpu/intel_cacheinfo.c
+++ b/arch/x86/kernel/cpu/intel_cacheinfo.c
@@ -811,7 +811,24 @@ static int __cache_amd_cpumap_setup(unsi
struct cacheinfo *this_leaf;
int i, sibling;
- if (boot_cpu_has(X86_FEATURE_TOPOEXT)) {
+ /*
+ * For L3, always use the pre-calculated cpu_llc_shared_mask
+ * to derive shared_cpu_map.
+ */
+ if (index == 3) {
+ for_each_cpu(i, cpu_llc_shared_mask(cpu)) {
+ this_cpu_ci = get_cpu_cacheinfo(i);
+ if (!this_cpu_ci->info_list)
+ continue;
+ this_leaf = this_cpu_ci->info_list + index;
+ for_each_cpu(sibling, cpu_llc_shared_mask(cpu)) {
+ if (!cpu_online(sibling))
+ continue;
+ cpumask_set_cpu(sibling,
+ &this_leaf->shared_cpu_map);
+ }
+ }
+ } else if (boot_cpu_has(X86_FEATURE_TOPOEXT)) {
unsigned int apicid, nshared, first, last;
this_leaf = this_cpu_ci->info_list + index;
@@ -837,19 +854,6 @@ static int __cache_amd_cpumap_setup(unsi
continue;
cpumask_set_cpu(sibling,
&this_leaf->shared_cpu_map);
- }
- }
- } else if (index == 3) {
- for_each_cpu(i, cpu_llc_shared_mask(cpu)) {
- this_cpu_ci = get_cpu_cacheinfo(i);
- if (!this_cpu_ci->info_list)
- continue;
- this_leaf = this_cpu_ci->info_list + index;
- for_each_cpu(sibling, cpu_llc_shared_mask(cpu)) {
- if (!cpu_online(sibling))
- continue;
- cpumask_set_cpu(sibling,
- &this_leaf->shared_cpu_map);
}
}
} else
Patches currently in stable-queue which might be from suravee.suthikulpanit(a)amd.com are
queue-4.13/x86-cpu-amd-derive-l3-shared_cpu_map-from-cpu_llc_shared_mask.patch
Upstream commit ID: 2b83809a5e6d619a780876fcaf68cdc42b50d28c
Stable kernel version to apply: 4.13.x
Reason: This patch fixes the L3 topology for the AMD Zen-based
(family17h) processors with down-core configuration.
Thanks,
Suravee
This is a note to let you know that I've just added the patch titled
ocfs2: should wait dio before inode lock in ocfs2_setattr()
to the 4.9-stable tree which can be found at:
http://www.kernel.org/git/?p=linux/kernel/git/stable/stable-queue.git;a=sum…
The filename of the patch is:
ocfs2-should-wait-dio-before-inode-lock-in-ocfs2_setattr.patch
and it can be found in the queue-4.9 subdirectory.
If you, or anyone else, feels it should not be added to the stable tree,
please let <stable(a)vger.kernel.org> know about it.
>From 28f5a8a7c033cbf3e32277f4cc9c6afd74f05300 Mon Sep 17 00:00:00 2001
From: alex chen <alex.chen(a)huawei.com>
Date: Wed, 15 Nov 2017 17:31:40 -0800
Subject: ocfs2: should wait dio before inode lock in ocfs2_setattr()
From: alex chen <alex.chen(a)huawei.com>
commit 28f5a8a7c033cbf3e32277f4cc9c6afd74f05300 upstream.
we should wait dio requests to finish before inode lock in
ocfs2_setattr(), otherwise the following deadlock will happen:
process 1 process 2 process 3
truncate file 'A' end_io of writing file 'A' receiving the bast messages
ocfs2_setattr
ocfs2_inode_lock_tracker
ocfs2_inode_lock_full
inode_dio_wait
__inode_dio_wait
-->waiting for all dio
requests finish
dlm_proxy_ast_handler
dlm_do_local_bast
ocfs2_blocking_ast
ocfs2_generic_handle_bast
set OCFS2_LOCK_BLOCKED flag
dio_end_io
dio_bio_end_aio
dio_complete
ocfs2_dio_end_io
ocfs2_dio_end_io_write
ocfs2_inode_lock
__ocfs2_cluster_lock
ocfs2_wait_for_mask
-->waiting for OCFS2_LOCK_BLOCKED
flag to be cleared, that is waiting
for 'process 1' unlocking the inode lock
inode_dio_end
-->here dec the i_dio_count, but will never
be called, so a deadlock happened.
Link: http://lkml.kernel.org/r/59F81636.70508@huawei.com
Signed-off-by: Alex Chen <alex.chen(a)huawei.com>
Reviewed-by: Jun Piao <piaojun(a)huawei.com>
Reviewed-by: Joseph Qi <jiangqi903(a)gmail.com>
Acked-by: Changwei Ge <ge.changwei(a)h3c.com>
Cc: Mark Fasheh <mfasheh(a)versity.com>
Cc: Joel Becker <jlbec(a)evilplan.org>
Cc: Junxiao Bi <junxiao.bi(a)oracle.com>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds(a)linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh(a)linuxfoundation.org>
---
fs/ocfs2/file.c | 9 +++++++--
1 file changed, 7 insertions(+), 2 deletions(-)
--- a/fs/ocfs2/file.c
+++ b/fs/ocfs2/file.c
@@ -1166,6 +1166,13 @@ int ocfs2_setattr(struct dentry *dentry,
}
size_change = S_ISREG(inode->i_mode) && attr->ia_valid & ATTR_SIZE;
if (size_change) {
+ /*
+ * Here we should wait dio to finish before inode lock
+ * to avoid a deadlock between ocfs2_setattr() and
+ * ocfs2_dio_end_io_write()
+ */
+ inode_dio_wait(inode);
+
status = ocfs2_rw_lock(inode, 1);
if (status < 0) {
mlog_errno(status);
@@ -1186,8 +1193,6 @@ int ocfs2_setattr(struct dentry *dentry,
if (status)
goto bail_unlock;
- inode_dio_wait(inode);
-
if (i_size_read(inode) >= attr->ia_size) {
if (ocfs2_should_order_data(inode)) {
status = ocfs2_begin_ordered_truncate(inode,
Patches currently in stable-queue which might be from alex.chen(a)huawei.com are
queue-4.9/ocfs2-should-wait-dio-before-inode-lock-in-ocfs2_setattr.patch
This is a note to let you know that I've just added the patch titled
ocfs2: fix cluster hang after a node dies
to the 4.9-stable tree which can be found at:
http://www.kernel.org/git/?p=linux/kernel/git/stable/stable-queue.git;a=sum…
The filename of the patch is:
ocfs2-fix-cluster-hang-after-a-node-dies.patch
and it can be found in the queue-4.9 subdirectory.
If you, or anyone else, feels it should not be added to the stable tree,
please let <stable(a)vger.kernel.org> know about it.
>From 1c01967116a678fed8e2c68a6ab82abc8effeddc Mon Sep 17 00:00:00 2001
From: Changwei Ge <ge.changwei(a)h3c.com>
Date: Wed, 15 Nov 2017 17:31:33 -0800
Subject: ocfs2: fix cluster hang after a node dies
From: Changwei Ge <ge.changwei(a)h3c.com>
commit 1c01967116a678fed8e2c68a6ab82abc8effeddc upstream.
When a node dies, other live nodes have to choose a new master for an
existed lock resource mastered by the dead node.
As for ocfs2/dlm implementation, this is done by function -
dlm_move_lockres_to_recovery_list which marks those lock rsources as
DLM_LOCK_RES_RECOVERING and manages them via a list from which DLM
changes lock resource's master later.
So without invoking dlm_move_lockres_to_recovery_list, no master will be
choosed after dlm recovery accomplishment since no lock resource can be
found through ::resource list.
What's worse is that if DLM_LOCK_RES_RECOVERING is not marked for lock
resources mastered a dead node, it will break up synchronization among
nodes.
So invoke dlm_move_lockres_to_recovery_list again.
Fixs: 'commit ee8f7fcbe638 ("ocfs2/dlm: continue to purge recovery lockres when recovery master goes down")'
Link: http://lkml.kernel.org/r/63ADC13FD55D6546B7DECE290D39E373CED6E0F9@H3CMLB14-…
Signed-off-by: Changwei Ge <ge.changwei(a)h3c.com>
Reported-by: Vitaly Mayatskih <v.mayatskih(a)gmail.com>
Tested-by: Vitaly Mayatskikh <v.mayatskih(a)gmail.com>
Cc: Mark Fasheh <mfasheh(a)versity.com>
Cc: Joel Becker <jlbec(a)evilplan.org>
Cc: Junxiao Bi <junxiao.bi(a)oracle.com>
Cc: Joseph Qi <jiangqi903(a)gmail.com>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds(a)linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh(a)linuxfoundation.org>
---
fs/ocfs2/dlm/dlmrecovery.c | 1 +
1 file changed, 1 insertion(+)
--- a/fs/ocfs2/dlm/dlmrecovery.c
+++ b/fs/ocfs2/dlm/dlmrecovery.c
@@ -2419,6 +2419,7 @@ static void dlm_do_local_recovery_cleanu
dlm_lockres_put(res);
continue;
}
+ dlm_move_lockres_to_recovery_list(dlm, res);
} else if (res->owner == dlm->node_num) {
dlm_free_dead_locks(dlm, res, dead_node);
__dlm_lockres_calc_usage(dlm, res);
Patches currently in stable-queue which might be from ge.changwei(a)h3c.com are
queue-4.9/ocfs2-fix-cluster-hang-after-a-node-dies.patch
queue-4.9/ocfs2-should-wait-dio-before-inode-lock-in-ocfs2_setattr.patch
This is a note to let you know that I've just added the patch titled
ipmi: fix unsigned long underflow
to the 4.9-stable tree which can be found at:
http://www.kernel.org/git/?p=linux/kernel/git/stable/stable-queue.git;a=sum…
The filename of the patch is:
ipmi-fix-unsigned-long-underflow.patch
and it can be found in the queue-4.9 subdirectory.
If you, or anyone else, feels it should not be added to the stable tree,
please let <stable(a)vger.kernel.org> know about it.
>From 392a17b10ec4320d3c0e96e2a23ebaad1123b989 Mon Sep 17 00:00:00 2001
From: Corey Minyard <cminyard(a)mvista.com>
Date: Sat, 29 Jul 2017 21:14:55 -0500
Subject: ipmi: fix unsigned long underflow
From: Corey Minyard <cminyard(a)mvista.com>
commit 392a17b10ec4320d3c0e96e2a23ebaad1123b989 upstream.
When I set the timeout to a specific value such as 500ms, the timeout
event will not happen in time due to the overflow in function
check_msg_timeout:
...
ent->timeout -= timeout_period;
if (ent->timeout > 0)
return;
...
The type of timeout_period is long, but ent->timeout is unsigned long.
This patch makes the type consistent.
Reported-by: Weilong Chen <chenweilong(a)huawei.com>
Signed-off-by: Corey Minyard <cminyard(a)mvista.com>
Tested-by: Weilong Chen <chenweilong(a)huawei.com>
Signed-off-by: Greg Kroah-Hartman <gregkh(a)linuxfoundation.org>
---
drivers/char/ipmi/ipmi_msghandler.c | 10 ++++++----
1 file changed, 6 insertions(+), 4 deletions(-)
--- a/drivers/char/ipmi/ipmi_msghandler.c
+++ b/drivers/char/ipmi/ipmi_msghandler.c
@@ -4029,7 +4029,8 @@ smi_from_recv_msg(ipmi_smi_t intf, struc
}
static void check_msg_timeout(ipmi_smi_t intf, struct seq_table *ent,
- struct list_head *timeouts, long timeout_period,
+ struct list_head *timeouts,
+ unsigned long timeout_period,
int slot, unsigned long *flags,
unsigned int *waiting_msgs)
{
@@ -4042,8 +4043,8 @@ static void check_msg_timeout(ipmi_smi_t
if (!ent->inuse)
return;
- ent->timeout -= timeout_period;
- if (ent->timeout > 0) {
+ if (timeout_period < ent->timeout) {
+ ent->timeout -= timeout_period;
(*waiting_msgs)++;
return;
}
@@ -4109,7 +4110,8 @@ static void check_msg_timeout(ipmi_smi_t
}
}
-static unsigned int ipmi_timeout_handler(ipmi_smi_t intf, long timeout_period)
+static unsigned int ipmi_timeout_handler(ipmi_smi_t intf,
+ unsigned long timeout_period)
{
struct list_head timeouts;
struct ipmi_recv_msg *msg, *msg2;
Patches currently in stable-queue which might be from cminyard(a)mvista.com are
queue-4.9/ipmi-fix-unsigned-long-underflow.patch
This is a note to let you know that I've just added the patch titled
mm/page_alloc.c: broken deferred calculation
to the 4.9-stable tree which can be found at:
http://www.kernel.org/git/?p=linux/kernel/git/stable/stable-queue.git;a=sum…
The filename of the patch is:
mm-page_alloc.c-broken-deferred-calculation.patch
and it can be found in the queue-4.9 subdirectory.
If you, or anyone else, feels it should not be added to the stable tree,
please let <stable(a)vger.kernel.org> know about it.
>From d135e5750205a21a212a19dbb05aeb339e2cbea7 Mon Sep 17 00:00:00 2001
From: Pavel Tatashin <pasha.tatashin(a)oracle.com>
Date: Wed, 15 Nov 2017 17:38:41 -0800
Subject: mm/page_alloc.c: broken deferred calculation
From: Pavel Tatashin <pasha.tatashin(a)oracle.com>
commit d135e5750205a21a212a19dbb05aeb339e2cbea7 upstream.
In reset_deferred_meminit() we determine number of pages that must not
be deferred. We initialize pages for at least 2G of memory, but also
pages for reserved memory in this node.
The reserved memory is determined in this function:
memblock_reserved_memory_within(), which operates over physical
addresses, and returns size in bytes. However, reset_deferred_meminit()
assumes that that this function operates with pfns, and returns page
count.
The result is that in the best case machine boots slower than expected
due to initializing more pages than needed in single thread, and in the
worst case panics because fewer than needed pages are initialized early.
Link: http://lkml.kernel.org/r/20171021011707.15191-1-pasha.tatashin@oracle.com
Fixes: 864b9a393dcb ("mm: consider memblock reservations for deferred memory initialization sizing")
Signed-off-by: Pavel Tatashin <pasha.tatashin(a)oracle.com>
Acked-by: Michal Hocko <mhocko(a)suse.com>
Cc: Mel Gorman <mgorman(a)techsingularity.net>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds(a)linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh(a)linuxfoundation.org>
---
include/linux/mmzone.h | 3 ++-
mm/page_alloc.c | 27 ++++++++++++++++++---------
2 files changed, 20 insertions(+), 10 deletions(-)
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -672,7 +672,8 @@ typedef struct pglist_data {
* is the first PFN that needs to be initialised.
*/
unsigned long first_deferred_pfn;
- unsigned long static_init_size;
+ /* Number of non-deferred pages */
+ unsigned long static_init_pgcnt;
#endif /* CONFIG_DEFERRED_STRUCT_PAGE_INIT */
#ifdef CONFIG_TRANSPARENT_HUGEPAGE
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -284,28 +284,37 @@ EXPORT_SYMBOL(nr_online_nodes);
int page_group_by_mobility_disabled __read_mostly;
#ifdef CONFIG_DEFERRED_STRUCT_PAGE_INIT
+
+/*
+ * Determine how many pages need to be initialized durig early boot
+ * (non-deferred initialization).
+ * The value of first_deferred_pfn will be set later, once non-deferred pages
+ * are initialized, but for now set it ULONG_MAX.
+ */
static inline void reset_deferred_meminit(pg_data_t *pgdat)
{
- unsigned long max_initialise;
- unsigned long reserved_lowmem;
+ phys_addr_t start_addr, end_addr;
+ unsigned long max_pgcnt;
+ unsigned long reserved;
/*
* Initialise at least 2G of a node but also take into account that
* two large system hashes that can take up 1GB for 0.25TB/node.
*/
- max_initialise = max(2UL << (30 - PAGE_SHIFT),
- (pgdat->node_spanned_pages >> 8));
+ max_pgcnt = max(2UL << (30 - PAGE_SHIFT),
+ (pgdat->node_spanned_pages >> 8));
/*
* Compensate the all the memblock reservations (e.g. crash kernel)
* from the initial estimation to make sure we will initialize enough
* memory to boot.
*/
- reserved_lowmem = memblock_reserved_memory_within(pgdat->node_start_pfn,
- pgdat->node_start_pfn + max_initialise);
- max_initialise += reserved_lowmem;
+ start_addr = PFN_PHYS(pgdat->node_start_pfn);
+ end_addr = PFN_PHYS(pgdat->node_start_pfn + max_pgcnt);
+ reserved = memblock_reserved_memory_within(start_addr, end_addr);
+ max_pgcnt += PHYS_PFN(reserved);
- pgdat->static_init_size = min(max_initialise, pgdat->node_spanned_pages);
+ pgdat->static_init_pgcnt = min(max_pgcnt, pgdat->node_spanned_pages);
pgdat->first_deferred_pfn = ULONG_MAX;
}
@@ -332,7 +341,7 @@ static inline bool update_defer_init(pg_
if (zone_end < pgdat_end_pfn(pgdat))
return true;
(*nr_initialised)++;
- if ((*nr_initialised > pgdat->static_init_size) &&
+ if ((*nr_initialised > pgdat->static_init_pgcnt) &&
(pfn & (PAGES_PER_SECTION - 1)) == 0) {
pgdat->first_deferred_pfn = pfn;
return false;
Patches currently in stable-queue which might be from pasha.tatashin(a)oracle.com are
queue-4.9/mm-page_alloc.c-broken-deferred-calculation.patch
This is a note to let you know that I've just added the patch titled
dmaengine: dmatest: warn user when dma test times out
to the 4.9-stable tree which can be found at:
http://www.kernel.org/git/?p=linux/kernel/git/stable/stable-queue.git;a=sum…
The filename of the patch is:
dmaengine-dmatest-warn-user-when-dma-test-times-out.patch
and it can be found in the queue-4.9 subdirectory.
If you, or anyone else, feels it should not be added to the stable tree,
please let <stable(a)vger.kernel.org> know about it.
>From a9df21e34b422f79d9a9fa5c3eff8c2a53491be6 Mon Sep 17 00:00:00 2001
From: Adam Wallis <awallis(a)codeaurora.org>
Date: Thu, 2 Nov 2017 08:53:30 -0400
Subject: dmaengine: dmatest: warn user when dma test times out
From: Adam Wallis <awallis(a)codeaurora.org>
commit a9df21e34b422f79d9a9fa5c3eff8c2a53491be6 upstream.
Commit adfa543e7314 ("dmatest: don't use set_freezable_with_signal()")
introduced a bug (that is in fact documented by the patch commit text)
that leaves behind a dangling pointer. Since the done_wait structure is
allocated on the stack, future invocations to the DMATEST can produce
undesirable results (e.g., corrupted spinlocks). Ideally, this would be
cleaned up in the thread handler, but at the very least, the kernel
is left in a very precarious scenario that can lead to some long debug
sessions when the crash comes later.
Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=197605
Signed-off-by: Adam Wallis <awallis(a)codeaurora.org>
Signed-off-by: Vinod Koul <vinod.koul(a)intel.com>
Signed-off-by: Greg Kroah-Hartman <gregkh(a)linuxfoundation.org>
---
drivers/dma/dmatest.c | 1 +
1 file changed, 1 insertion(+)
--- a/drivers/dma/dmatest.c
+++ b/drivers/dma/dmatest.c
@@ -666,6 +666,7 @@ static int dmatest_func(void *data)
* free it this time?" dancing. For now, just
* leave it dangling.
*/
+ WARN(1, "dmatest: Kernel stack may be corrupted!!\n");
dmaengine_unmap_put(um);
result("test timed out", total_tests, src_off, dst_off,
len, 0);
Patches currently in stable-queue which might be from awallis(a)codeaurora.org are
queue-4.9/dmaengine-dmatest-warn-user-when-dma-test-times-out.patch
This is a note to let you know that I've just added the patch titled
ocfs2: should wait dio before inode lock in ocfs2_setattr()
to the 4.4-stable tree which can be found at:
http://www.kernel.org/git/?p=linux/kernel/git/stable/stable-queue.git;a=sum…
The filename of the patch is:
ocfs2-should-wait-dio-before-inode-lock-in-ocfs2_setattr.patch
and it can be found in the queue-4.4 subdirectory.
If you, or anyone else, feels it should not be added to the stable tree,
please let <stable(a)vger.kernel.org> know about it.
>From 28f5a8a7c033cbf3e32277f4cc9c6afd74f05300 Mon Sep 17 00:00:00 2001
From: alex chen <alex.chen(a)huawei.com>
Date: Wed, 15 Nov 2017 17:31:40 -0800
Subject: ocfs2: should wait dio before inode lock in ocfs2_setattr()
From: alex chen <alex.chen(a)huawei.com>
commit 28f5a8a7c033cbf3e32277f4cc9c6afd74f05300 upstream.
we should wait dio requests to finish before inode lock in
ocfs2_setattr(), otherwise the following deadlock will happen:
process 1 process 2 process 3
truncate file 'A' end_io of writing file 'A' receiving the bast messages
ocfs2_setattr
ocfs2_inode_lock_tracker
ocfs2_inode_lock_full
inode_dio_wait
__inode_dio_wait
-->waiting for all dio
requests finish
dlm_proxy_ast_handler
dlm_do_local_bast
ocfs2_blocking_ast
ocfs2_generic_handle_bast
set OCFS2_LOCK_BLOCKED flag
dio_end_io
dio_bio_end_aio
dio_complete
ocfs2_dio_end_io
ocfs2_dio_end_io_write
ocfs2_inode_lock
__ocfs2_cluster_lock
ocfs2_wait_for_mask
-->waiting for OCFS2_LOCK_BLOCKED
flag to be cleared, that is waiting
for 'process 1' unlocking the inode lock
inode_dio_end
-->here dec the i_dio_count, but will never
be called, so a deadlock happened.
Link: http://lkml.kernel.org/r/59F81636.70508@huawei.com
Signed-off-by: Alex Chen <alex.chen(a)huawei.com>
Reviewed-by: Jun Piao <piaojun(a)huawei.com>
Reviewed-by: Joseph Qi <jiangqi903(a)gmail.com>
Acked-by: Changwei Ge <ge.changwei(a)h3c.com>
Cc: Mark Fasheh <mfasheh(a)versity.com>
Cc: Joel Becker <jlbec(a)evilplan.org>
Cc: Junxiao Bi <junxiao.bi(a)oracle.com>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds(a)linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh(a)linuxfoundation.org>
---
fs/ocfs2/file.c | 9 +++++++--
1 file changed, 7 insertions(+), 2 deletions(-)
--- a/fs/ocfs2/file.c
+++ b/fs/ocfs2/file.c
@@ -1166,6 +1166,13 @@ int ocfs2_setattr(struct dentry *dentry,
}
size_change = S_ISREG(inode->i_mode) && attr->ia_valid & ATTR_SIZE;
if (size_change) {
+ /*
+ * Here we should wait dio to finish before inode lock
+ * to avoid a deadlock between ocfs2_setattr() and
+ * ocfs2_dio_end_io_write()
+ */
+ inode_dio_wait(inode);
+
status = ocfs2_rw_lock(inode, 1);
if (status < 0) {
mlog_errno(status);
@@ -1186,8 +1193,6 @@ int ocfs2_setattr(struct dentry *dentry,
if (status)
goto bail_unlock;
- inode_dio_wait(inode);
-
if (i_size_read(inode) >= attr->ia_size) {
if (ocfs2_should_order_data(inode)) {
status = ocfs2_begin_ordered_truncate(inode,
Patches currently in stable-queue which might be from alex.chen(a)huawei.com are
queue-4.4/ocfs2-should-wait-dio-before-inode-lock-in-ocfs2_setattr.patch
This patch is a fix specific to the 3.19 - 4.4 kernels. The 4.5 kernel
inadvertently fixed this bug differently (db3cbfff5bcc0), but is not
a stable candidate due it being a complicated re-write of the entire
feature.
This patch fixes a potential timing bug with nvme's asynchronous queue
deletion, which causes an allocated request to be accidentally released
due to the ordering of the shared completion context among the sq/cq
pair. The completion context saves the request that issued the queue
deletion. If the submission side deletion happens to reset the active
request, the completion side will release the wrong request tag back into
the pool of available tags. This means the driver will create multiple
commands with the same tag, corrupting the queue context.
The error is observable in the kernel logs like:
"nvme XX:YY:ZZ completed id XX twice on qid:0"
In this particular case, this message occurs because the queue is
corrupted.
The following timing sequence demonstrates the error:
CPU A CPU B
----------------------- -----------------------------
nvme_irq
nvme_process_cq
async_completion
queue_kthread_work -----------> nvme_del_sq_work_handler
nvme_delete_cq
adapter_async_del_queue
nvme_submit_admin_async_cmd
cmdinfo->req = req;
blk_mq_free_request(cmdinfo->req); <-- wrong request!!!
This patch fixes the bug by releasing the request in the completion side
prior to waking the submission thread, such that that thread can't muck
with the shared completion context.
Fixes: a4aea5623d4a5 ("NVMe: Convert to blk-mq")
Cc: <stable(a)vger.kernel.org> # 4.4.x
Cc: <stable(a)vger.kernel.org> # 4.1.x
Signed-off-by: Keith Busch <keith.busch(a)intel.com>
---
drivers/nvme/host/pci.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index 669edbd47602..d6ceb8b91cd6 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -350,8 +350,8 @@ static void async_completion(struct nvme_queue *nvmeq, void *ctx,
struct async_cmd_info *cmdinfo = ctx;
cmdinfo->result = le32_to_cpup(&cqe->result);
cmdinfo->status = le16_to_cpup(&cqe->status) >> 1;
- queue_kthread_work(cmdinfo->worker, &cmdinfo->work);
blk_mq_free_request(cmdinfo->req);
+ queue_kthread_work(cmdinfo->worker, &cmdinfo->work);
}
static inline struct nvme_cmd_info *get_cmd_from_tag(struct nvme_queue *nvmeq,
--
2.13.6
This is a note to let you know that I've just added the patch titled
mm/page_alloc.c: broken deferred calculation
to the 4.4-stable tree which can be found at:
http://www.kernel.org/git/?p=linux/kernel/git/stable/stable-queue.git;a=sum…
The filename of the patch is:
mm-page_alloc.c-broken-deferred-calculation.patch
and it can be found in the queue-4.4 subdirectory.
If you, or anyone else, feels it should not be added to the stable tree,
please let <stable(a)vger.kernel.org> know about it.
>From d135e5750205a21a212a19dbb05aeb339e2cbea7 Mon Sep 17 00:00:00 2001
From: Pavel Tatashin <pasha.tatashin(a)oracle.com>
Date: Wed, 15 Nov 2017 17:38:41 -0800
Subject: mm/page_alloc.c: broken deferred calculation
From: Pavel Tatashin <pasha.tatashin(a)oracle.com>
commit d135e5750205a21a212a19dbb05aeb339e2cbea7 upstream.
In reset_deferred_meminit() we determine number of pages that must not
be deferred. We initialize pages for at least 2G of memory, but also
pages for reserved memory in this node.
The reserved memory is determined in this function:
memblock_reserved_memory_within(), which operates over physical
addresses, and returns size in bytes. However, reset_deferred_meminit()
assumes that that this function operates with pfns, and returns page
count.
The result is that in the best case machine boots slower than expected
due to initializing more pages than needed in single thread, and in the
worst case panics because fewer than needed pages are initialized early.
Link: http://lkml.kernel.org/r/20171021011707.15191-1-pasha.tatashin@oracle.com
Fixes: 864b9a393dcb ("mm: consider memblock reservations for deferred memory initialization sizing")
Signed-off-by: Pavel Tatashin <pasha.tatashin(a)oracle.com>
Acked-by: Michal Hocko <mhocko(a)suse.com>
Cc: Mel Gorman <mgorman(a)techsingularity.net>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds(a)linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh(a)linuxfoundation.org>
---
include/linux/mmzone.h | 3 ++-
mm/page_alloc.c | 27 ++++++++++++++++++---------
2 files changed, 20 insertions(+), 10 deletions(-)
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -688,7 +688,8 @@ typedef struct pglist_data {
* is the first PFN that needs to be initialised.
*/
unsigned long first_deferred_pfn;
- unsigned long static_init_size;
+ /* Number of non-deferred pages */
+ unsigned long static_init_pgcnt;
#endif /* CONFIG_DEFERRED_STRUCT_PAGE_INIT */
} pg_data_t;
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -267,28 +267,37 @@ EXPORT_SYMBOL(nr_online_nodes);
int page_group_by_mobility_disabled __read_mostly;
#ifdef CONFIG_DEFERRED_STRUCT_PAGE_INIT
+
+/*
+ * Determine how many pages need to be initialized durig early boot
+ * (non-deferred initialization).
+ * The value of first_deferred_pfn will be set later, once non-deferred pages
+ * are initialized, but for now set it ULONG_MAX.
+ */
static inline void reset_deferred_meminit(pg_data_t *pgdat)
{
- unsigned long max_initialise;
- unsigned long reserved_lowmem;
+ phys_addr_t start_addr, end_addr;
+ unsigned long max_pgcnt;
+ unsigned long reserved;
/*
* Initialise at least 2G of a node but also take into account that
* two large system hashes that can take up 1GB for 0.25TB/node.
*/
- max_initialise = max(2UL << (30 - PAGE_SHIFT),
- (pgdat->node_spanned_pages >> 8));
+ max_pgcnt = max(2UL << (30 - PAGE_SHIFT),
+ (pgdat->node_spanned_pages >> 8));
/*
* Compensate the all the memblock reservations (e.g. crash kernel)
* from the initial estimation to make sure we will initialize enough
* memory to boot.
*/
- reserved_lowmem = memblock_reserved_memory_within(pgdat->node_start_pfn,
- pgdat->node_start_pfn + max_initialise);
- max_initialise += reserved_lowmem;
+ start_addr = PFN_PHYS(pgdat->node_start_pfn);
+ end_addr = PFN_PHYS(pgdat->node_start_pfn + max_pgcnt);
+ reserved = memblock_reserved_memory_within(start_addr, end_addr);
+ max_pgcnt += PHYS_PFN(reserved);
- pgdat->static_init_size = min(max_initialise, pgdat->node_spanned_pages);
+ pgdat->static_init_pgcnt = min(max_pgcnt, pgdat->node_spanned_pages);
pgdat->first_deferred_pfn = ULONG_MAX;
}
@@ -324,7 +333,7 @@ static inline bool update_defer_init(pg_
return true;
/* Initialise at least 2G of the highest zone */
(*nr_initialised)++;
- if ((*nr_initialised > pgdat->static_init_size) &&
+ if ((*nr_initialised > pgdat->static_init_pgcnt) &&
(pfn & (PAGES_PER_SECTION - 1)) == 0) {
pgdat->first_deferred_pfn = pfn;
return false;
Patches currently in stable-queue which might be from pasha.tatashin(a)oracle.com are
queue-4.4/mm-page_alloc.c-broken-deferred-calculation.patch