Upstream commit ID: 2b83809a5e6d619a780876fcaf68cdc42b50d28c
Stable kernel version to apply: 4.13.x
Reason: This patch fixes the L3 topology for the AMD Zen-based
(family17h) processors with down-core configuration.
Thanks,
Suravee
This is a note to let you know that I've just added the patch titled
ocfs2: should wait dio before inode lock in ocfs2_setattr()
to the 4.9-stable tree which can be found at:
http://www.kernel.org/git/?p=linux/kernel/git/stable/stable-queue.git;a=sum…
The filename of the patch is:
ocfs2-should-wait-dio-before-inode-lock-in-ocfs2_setattr.patch
and it can be found in the queue-4.9 subdirectory.
If you, or anyone else, feels it should not be added to the stable tree,
please let <stable(a)vger.kernel.org> know about it.
>From 28f5a8a7c033cbf3e32277f4cc9c6afd74f05300 Mon Sep 17 00:00:00 2001
From: alex chen <alex.chen(a)huawei.com>
Date: Wed, 15 Nov 2017 17:31:40 -0800
Subject: ocfs2: should wait dio before inode lock in ocfs2_setattr()
From: alex chen <alex.chen(a)huawei.com>
commit 28f5a8a7c033cbf3e32277f4cc9c6afd74f05300 upstream.
we should wait dio requests to finish before inode lock in
ocfs2_setattr(), otherwise the following deadlock will happen:
process 1 process 2 process 3
truncate file 'A' end_io of writing file 'A' receiving the bast messages
ocfs2_setattr
ocfs2_inode_lock_tracker
ocfs2_inode_lock_full
inode_dio_wait
__inode_dio_wait
-->waiting for all dio
requests finish
dlm_proxy_ast_handler
dlm_do_local_bast
ocfs2_blocking_ast
ocfs2_generic_handle_bast
set OCFS2_LOCK_BLOCKED flag
dio_end_io
dio_bio_end_aio
dio_complete
ocfs2_dio_end_io
ocfs2_dio_end_io_write
ocfs2_inode_lock
__ocfs2_cluster_lock
ocfs2_wait_for_mask
-->waiting for OCFS2_LOCK_BLOCKED
flag to be cleared, that is waiting
for 'process 1' unlocking the inode lock
inode_dio_end
-->here dec the i_dio_count, but will never
be called, so a deadlock happened.
Link: http://lkml.kernel.org/r/59F81636.70508@huawei.com
Signed-off-by: Alex Chen <alex.chen(a)huawei.com>
Reviewed-by: Jun Piao <piaojun(a)huawei.com>
Reviewed-by: Joseph Qi <jiangqi903(a)gmail.com>
Acked-by: Changwei Ge <ge.changwei(a)h3c.com>
Cc: Mark Fasheh <mfasheh(a)versity.com>
Cc: Joel Becker <jlbec(a)evilplan.org>
Cc: Junxiao Bi <junxiao.bi(a)oracle.com>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds(a)linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh(a)linuxfoundation.org>
---
fs/ocfs2/file.c | 9 +++++++--
1 file changed, 7 insertions(+), 2 deletions(-)
--- a/fs/ocfs2/file.c
+++ b/fs/ocfs2/file.c
@@ -1166,6 +1166,13 @@ int ocfs2_setattr(struct dentry *dentry,
}
size_change = S_ISREG(inode->i_mode) && attr->ia_valid & ATTR_SIZE;
if (size_change) {
+ /*
+ * Here we should wait dio to finish before inode lock
+ * to avoid a deadlock between ocfs2_setattr() and
+ * ocfs2_dio_end_io_write()
+ */
+ inode_dio_wait(inode);
+
status = ocfs2_rw_lock(inode, 1);
if (status < 0) {
mlog_errno(status);
@@ -1186,8 +1193,6 @@ int ocfs2_setattr(struct dentry *dentry,
if (status)
goto bail_unlock;
- inode_dio_wait(inode);
-
if (i_size_read(inode) >= attr->ia_size) {
if (ocfs2_should_order_data(inode)) {
status = ocfs2_begin_ordered_truncate(inode,
Patches currently in stable-queue which might be from alex.chen(a)huawei.com are
queue-4.9/ocfs2-should-wait-dio-before-inode-lock-in-ocfs2_setattr.patch
This is a note to let you know that I've just added the patch titled
ocfs2: fix cluster hang after a node dies
to the 4.9-stable tree which can be found at:
http://www.kernel.org/git/?p=linux/kernel/git/stable/stable-queue.git;a=sum…
The filename of the patch is:
ocfs2-fix-cluster-hang-after-a-node-dies.patch
and it can be found in the queue-4.9 subdirectory.
If you, or anyone else, feels it should not be added to the stable tree,
please let <stable(a)vger.kernel.org> know about it.
>From 1c01967116a678fed8e2c68a6ab82abc8effeddc Mon Sep 17 00:00:00 2001
From: Changwei Ge <ge.changwei(a)h3c.com>
Date: Wed, 15 Nov 2017 17:31:33 -0800
Subject: ocfs2: fix cluster hang after a node dies
From: Changwei Ge <ge.changwei(a)h3c.com>
commit 1c01967116a678fed8e2c68a6ab82abc8effeddc upstream.
When a node dies, other live nodes have to choose a new master for an
existed lock resource mastered by the dead node.
As for ocfs2/dlm implementation, this is done by function -
dlm_move_lockres_to_recovery_list which marks those lock rsources as
DLM_LOCK_RES_RECOVERING and manages them via a list from which DLM
changes lock resource's master later.
So without invoking dlm_move_lockres_to_recovery_list, no master will be
choosed after dlm recovery accomplishment since no lock resource can be
found through ::resource list.
What's worse is that if DLM_LOCK_RES_RECOVERING is not marked for lock
resources mastered a dead node, it will break up synchronization among
nodes.
So invoke dlm_move_lockres_to_recovery_list again.
Fixs: 'commit ee8f7fcbe638 ("ocfs2/dlm: continue to purge recovery lockres when recovery master goes down")'
Link: http://lkml.kernel.org/r/63ADC13FD55D6546B7DECE290D39E373CED6E0F9@H3CMLB14-…
Signed-off-by: Changwei Ge <ge.changwei(a)h3c.com>
Reported-by: Vitaly Mayatskih <v.mayatskih(a)gmail.com>
Tested-by: Vitaly Mayatskikh <v.mayatskih(a)gmail.com>
Cc: Mark Fasheh <mfasheh(a)versity.com>
Cc: Joel Becker <jlbec(a)evilplan.org>
Cc: Junxiao Bi <junxiao.bi(a)oracle.com>
Cc: Joseph Qi <jiangqi903(a)gmail.com>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds(a)linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh(a)linuxfoundation.org>
---
fs/ocfs2/dlm/dlmrecovery.c | 1 +
1 file changed, 1 insertion(+)
--- a/fs/ocfs2/dlm/dlmrecovery.c
+++ b/fs/ocfs2/dlm/dlmrecovery.c
@@ -2419,6 +2419,7 @@ static void dlm_do_local_recovery_cleanu
dlm_lockres_put(res);
continue;
}
+ dlm_move_lockres_to_recovery_list(dlm, res);
} else if (res->owner == dlm->node_num) {
dlm_free_dead_locks(dlm, res, dead_node);
__dlm_lockres_calc_usage(dlm, res);
Patches currently in stable-queue which might be from ge.changwei(a)h3c.com are
queue-4.9/ocfs2-fix-cluster-hang-after-a-node-dies.patch
queue-4.9/ocfs2-should-wait-dio-before-inode-lock-in-ocfs2_setattr.patch
This is a note to let you know that I've just added the patch titled
ipmi: fix unsigned long underflow
to the 4.9-stable tree which can be found at:
http://www.kernel.org/git/?p=linux/kernel/git/stable/stable-queue.git;a=sum…
The filename of the patch is:
ipmi-fix-unsigned-long-underflow.patch
and it can be found in the queue-4.9 subdirectory.
If you, or anyone else, feels it should not be added to the stable tree,
please let <stable(a)vger.kernel.org> know about it.
>From 392a17b10ec4320d3c0e96e2a23ebaad1123b989 Mon Sep 17 00:00:00 2001
From: Corey Minyard <cminyard(a)mvista.com>
Date: Sat, 29 Jul 2017 21:14:55 -0500
Subject: ipmi: fix unsigned long underflow
From: Corey Minyard <cminyard(a)mvista.com>
commit 392a17b10ec4320d3c0e96e2a23ebaad1123b989 upstream.
When I set the timeout to a specific value such as 500ms, the timeout
event will not happen in time due to the overflow in function
check_msg_timeout:
...
ent->timeout -= timeout_period;
if (ent->timeout > 0)
return;
...
The type of timeout_period is long, but ent->timeout is unsigned long.
This patch makes the type consistent.
Reported-by: Weilong Chen <chenweilong(a)huawei.com>
Signed-off-by: Corey Minyard <cminyard(a)mvista.com>
Tested-by: Weilong Chen <chenweilong(a)huawei.com>
Signed-off-by: Greg Kroah-Hartman <gregkh(a)linuxfoundation.org>
---
drivers/char/ipmi/ipmi_msghandler.c | 10 ++++++----
1 file changed, 6 insertions(+), 4 deletions(-)
--- a/drivers/char/ipmi/ipmi_msghandler.c
+++ b/drivers/char/ipmi/ipmi_msghandler.c
@@ -4029,7 +4029,8 @@ smi_from_recv_msg(ipmi_smi_t intf, struc
}
static void check_msg_timeout(ipmi_smi_t intf, struct seq_table *ent,
- struct list_head *timeouts, long timeout_period,
+ struct list_head *timeouts,
+ unsigned long timeout_period,
int slot, unsigned long *flags,
unsigned int *waiting_msgs)
{
@@ -4042,8 +4043,8 @@ static void check_msg_timeout(ipmi_smi_t
if (!ent->inuse)
return;
- ent->timeout -= timeout_period;
- if (ent->timeout > 0) {
+ if (timeout_period < ent->timeout) {
+ ent->timeout -= timeout_period;
(*waiting_msgs)++;
return;
}
@@ -4109,7 +4110,8 @@ static void check_msg_timeout(ipmi_smi_t
}
}
-static unsigned int ipmi_timeout_handler(ipmi_smi_t intf, long timeout_period)
+static unsigned int ipmi_timeout_handler(ipmi_smi_t intf,
+ unsigned long timeout_period)
{
struct list_head timeouts;
struct ipmi_recv_msg *msg, *msg2;
Patches currently in stable-queue which might be from cminyard(a)mvista.com are
queue-4.9/ipmi-fix-unsigned-long-underflow.patch
This is a note to let you know that I've just added the patch titled
mm/page_alloc.c: broken deferred calculation
to the 4.9-stable tree which can be found at:
http://www.kernel.org/git/?p=linux/kernel/git/stable/stable-queue.git;a=sum…
The filename of the patch is:
mm-page_alloc.c-broken-deferred-calculation.patch
and it can be found in the queue-4.9 subdirectory.
If you, or anyone else, feels it should not be added to the stable tree,
please let <stable(a)vger.kernel.org> know about it.
>From d135e5750205a21a212a19dbb05aeb339e2cbea7 Mon Sep 17 00:00:00 2001
From: Pavel Tatashin <pasha.tatashin(a)oracle.com>
Date: Wed, 15 Nov 2017 17:38:41 -0800
Subject: mm/page_alloc.c: broken deferred calculation
From: Pavel Tatashin <pasha.tatashin(a)oracle.com>
commit d135e5750205a21a212a19dbb05aeb339e2cbea7 upstream.
In reset_deferred_meminit() we determine number of pages that must not
be deferred. We initialize pages for at least 2G of memory, but also
pages for reserved memory in this node.
The reserved memory is determined in this function:
memblock_reserved_memory_within(), which operates over physical
addresses, and returns size in bytes. However, reset_deferred_meminit()
assumes that that this function operates with pfns, and returns page
count.
The result is that in the best case machine boots slower than expected
due to initializing more pages than needed in single thread, and in the
worst case panics because fewer than needed pages are initialized early.
Link: http://lkml.kernel.org/r/20171021011707.15191-1-pasha.tatashin@oracle.com
Fixes: 864b9a393dcb ("mm: consider memblock reservations for deferred memory initialization sizing")
Signed-off-by: Pavel Tatashin <pasha.tatashin(a)oracle.com>
Acked-by: Michal Hocko <mhocko(a)suse.com>
Cc: Mel Gorman <mgorman(a)techsingularity.net>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds(a)linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh(a)linuxfoundation.org>
---
include/linux/mmzone.h | 3 ++-
mm/page_alloc.c | 27 ++++++++++++++++++---------
2 files changed, 20 insertions(+), 10 deletions(-)
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -672,7 +672,8 @@ typedef struct pglist_data {
* is the first PFN that needs to be initialised.
*/
unsigned long first_deferred_pfn;
- unsigned long static_init_size;
+ /* Number of non-deferred pages */
+ unsigned long static_init_pgcnt;
#endif /* CONFIG_DEFERRED_STRUCT_PAGE_INIT */
#ifdef CONFIG_TRANSPARENT_HUGEPAGE
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -284,28 +284,37 @@ EXPORT_SYMBOL(nr_online_nodes);
int page_group_by_mobility_disabled __read_mostly;
#ifdef CONFIG_DEFERRED_STRUCT_PAGE_INIT
+
+/*
+ * Determine how many pages need to be initialized durig early boot
+ * (non-deferred initialization).
+ * The value of first_deferred_pfn will be set later, once non-deferred pages
+ * are initialized, but for now set it ULONG_MAX.
+ */
static inline void reset_deferred_meminit(pg_data_t *pgdat)
{
- unsigned long max_initialise;
- unsigned long reserved_lowmem;
+ phys_addr_t start_addr, end_addr;
+ unsigned long max_pgcnt;
+ unsigned long reserved;
/*
* Initialise at least 2G of a node but also take into account that
* two large system hashes that can take up 1GB for 0.25TB/node.
*/
- max_initialise = max(2UL << (30 - PAGE_SHIFT),
- (pgdat->node_spanned_pages >> 8));
+ max_pgcnt = max(2UL << (30 - PAGE_SHIFT),
+ (pgdat->node_spanned_pages >> 8));
/*
* Compensate the all the memblock reservations (e.g. crash kernel)
* from the initial estimation to make sure we will initialize enough
* memory to boot.
*/
- reserved_lowmem = memblock_reserved_memory_within(pgdat->node_start_pfn,
- pgdat->node_start_pfn + max_initialise);
- max_initialise += reserved_lowmem;
+ start_addr = PFN_PHYS(pgdat->node_start_pfn);
+ end_addr = PFN_PHYS(pgdat->node_start_pfn + max_pgcnt);
+ reserved = memblock_reserved_memory_within(start_addr, end_addr);
+ max_pgcnt += PHYS_PFN(reserved);
- pgdat->static_init_size = min(max_initialise, pgdat->node_spanned_pages);
+ pgdat->static_init_pgcnt = min(max_pgcnt, pgdat->node_spanned_pages);
pgdat->first_deferred_pfn = ULONG_MAX;
}
@@ -332,7 +341,7 @@ static inline bool update_defer_init(pg_
if (zone_end < pgdat_end_pfn(pgdat))
return true;
(*nr_initialised)++;
- if ((*nr_initialised > pgdat->static_init_size) &&
+ if ((*nr_initialised > pgdat->static_init_pgcnt) &&
(pfn & (PAGES_PER_SECTION - 1)) == 0) {
pgdat->first_deferred_pfn = pfn;
return false;
Patches currently in stable-queue which might be from pasha.tatashin(a)oracle.com are
queue-4.9/mm-page_alloc.c-broken-deferred-calculation.patch
This is a note to let you know that I've just added the patch titled
dmaengine: dmatest: warn user when dma test times out
to the 4.9-stable tree which can be found at:
http://www.kernel.org/git/?p=linux/kernel/git/stable/stable-queue.git;a=sum…
The filename of the patch is:
dmaengine-dmatest-warn-user-when-dma-test-times-out.patch
and it can be found in the queue-4.9 subdirectory.
If you, or anyone else, feels it should not be added to the stable tree,
please let <stable(a)vger.kernel.org> know about it.
>From a9df21e34b422f79d9a9fa5c3eff8c2a53491be6 Mon Sep 17 00:00:00 2001
From: Adam Wallis <awallis(a)codeaurora.org>
Date: Thu, 2 Nov 2017 08:53:30 -0400
Subject: dmaengine: dmatest: warn user when dma test times out
From: Adam Wallis <awallis(a)codeaurora.org>
commit a9df21e34b422f79d9a9fa5c3eff8c2a53491be6 upstream.
Commit adfa543e7314 ("dmatest: don't use set_freezable_with_signal()")
introduced a bug (that is in fact documented by the patch commit text)
that leaves behind a dangling pointer. Since the done_wait structure is
allocated on the stack, future invocations to the DMATEST can produce
undesirable results (e.g., corrupted spinlocks). Ideally, this would be
cleaned up in the thread handler, but at the very least, the kernel
is left in a very precarious scenario that can lead to some long debug
sessions when the crash comes later.
Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=197605
Signed-off-by: Adam Wallis <awallis(a)codeaurora.org>
Signed-off-by: Vinod Koul <vinod.koul(a)intel.com>
Signed-off-by: Greg Kroah-Hartman <gregkh(a)linuxfoundation.org>
---
drivers/dma/dmatest.c | 1 +
1 file changed, 1 insertion(+)
--- a/drivers/dma/dmatest.c
+++ b/drivers/dma/dmatest.c
@@ -666,6 +666,7 @@ static int dmatest_func(void *data)
* free it this time?" dancing. For now, just
* leave it dangling.
*/
+ WARN(1, "dmatest: Kernel stack may be corrupted!!\n");
dmaengine_unmap_put(um);
result("test timed out", total_tests, src_off, dst_off,
len, 0);
Patches currently in stable-queue which might be from awallis(a)codeaurora.org are
queue-4.9/dmaengine-dmatest-warn-user-when-dma-test-times-out.patch
This is a note to let you know that I've just added the patch titled
ocfs2: should wait dio before inode lock in ocfs2_setattr()
to the 4.4-stable tree which can be found at:
http://www.kernel.org/git/?p=linux/kernel/git/stable/stable-queue.git;a=sum…
The filename of the patch is:
ocfs2-should-wait-dio-before-inode-lock-in-ocfs2_setattr.patch
and it can be found in the queue-4.4 subdirectory.
If you, or anyone else, feels it should not be added to the stable tree,
please let <stable(a)vger.kernel.org> know about it.
>From 28f5a8a7c033cbf3e32277f4cc9c6afd74f05300 Mon Sep 17 00:00:00 2001
From: alex chen <alex.chen(a)huawei.com>
Date: Wed, 15 Nov 2017 17:31:40 -0800
Subject: ocfs2: should wait dio before inode lock in ocfs2_setattr()
From: alex chen <alex.chen(a)huawei.com>
commit 28f5a8a7c033cbf3e32277f4cc9c6afd74f05300 upstream.
we should wait dio requests to finish before inode lock in
ocfs2_setattr(), otherwise the following deadlock will happen:
process 1 process 2 process 3
truncate file 'A' end_io of writing file 'A' receiving the bast messages
ocfs2_setattr
ocfs2_inode_lock_tracker
ocfs2_inode_lock_full
inode_dio_wait
__inode_dio_wait
-->waiting for all dio
requests finish
dlm_proxy_ast_handler
dlm_do_local_bast
ocfs2_blocking_ast
ocfs2_generic_handle_bast
set OCFS2_LOCK_BLOCKED flag
dio_end_io
dio_bio_end_aio
dio_complete
ocfs2_dio_end_io
ocfs2_dio_end_io_write
ocfs2_inode_lock
__ocfs2_cluster_lock
ocfs2_wait_for_mask
-->waiting for OCFS2_LOCK_BLOCKED
flag to be cleared, that is waiting
for 'process 1' unlocking the inode lock
inode_dio_end
-->here dec the i_dio_count, but will never
be called, so a deadlock happened.
Link: http://lkml.kernel.org/r/59F81636.70508@huawei.com
Signed-off-by: Alex Chen <alex.chen(a)huawei.com>
Reviewed-by: Jun Piao <piaojun(a)huawei.com>
Reviewed-by: Joseph Qi <jiangqi903(a)gmail.com>
Acked-by: Changwei Ge <ge.changwei(a)h3c.com>
Cc: Mark Fasheh <mfasheh(a)versity.com>
Cc: Joel Becker <jlbec(a)evilplan.org>
Cc: Junxiao Bi <junxiao.bi(a)oracle.com>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds(a)linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh(a)linuxfoundation.org>
---
fs/ocfs2/file.c | 9 +++++++--
1 file changed, 7 insertions(+), 2 deletions(-)
--- a/fs/ocfs2/file.c
+++ b/fs/ocfs2/file.c
@@ -1166,6 +1166,13 @@ int ocfs2_setattr(struct dentry *dentry,
}
size_change = S_ISREG(inode->i_mode) && attr->ia_valid & ATTR_SIZE;
if (size_change) {
+ /*
+ * Here we should wait dio to finish before inode lock
+ * to avoid a deadlock between ocfs2_setattr() and
+ * ocfs2_dio_end_io_write()
+ */
+ inode_dio_wait(inode);
+
status = ocfs2_rw_lock(inode, 1);
if (status < 0) {
mlog_errno(status);
@@ -1186,8 +1193,6 @@ int ocfs2_setattr(struct dentry *dentry,
if (status)
goto bail_unlock;
- inode_dio_wait(inode);
-
if (i_size_read(inode) >= attr->ia_size) {
if (ocfs2_should_order_data(inode)) {
status = ocfs2_begin_ordered_truncate(inode,
Patches currently in stable-queue which might be from alex.chen(a)huawei.com are
queue-4.4/ocfs2-should-wait-dio-before-inode-lock-in-ocfs2_setattr.patch
This patch is a fix specific to the 3.19 - 4.4 kernels. The 4.5 kernel
inadvertently fixed this bug differently (db3cbfff5bcc0), but is not
a stable candidate due it being a complicated re-write of the entire
feature.
This patch fixes a potential timing bug with nvme's asynchronous queue
deletion, which causes an allocated request to be accidentally released
due to the ordering of the shared completion context among the sq/cq
pair. The completion context saves the request that issued the queue
deletion. If the submission side deletion happens to reset the active
request, the completion side will release the wrong request tag back into
the pool of available tags. This means the driver will create multiple
commands with the same tag, corrupting the queue context.
The error is observable in the kernel logs like:
"nvme XX:YY:ZZ completed id XX twice on qid:0"
In this particular case, this message occurs because the queue is
corrupted.
The following timing sequence demonstrates the error:
CPU A CPU B
----------------------- -----------------------------
nvme_irq
nvme_process_cq
async_completion
queue_kthread_work -----------> nvme_del_sq_work_handler
nvme_delete_cq
adapter_async_del_queue
nvme_submit_admin_async_cmd
cmdinfo->req = req;
blk_mq_free_request(cmdinfo->req); <-- wrong request!!!
This patch fixes the bug by releasing the request in the completion side
prior to waking the submission thread, such that that thread can't muck
with the shared completion context.
Fixes: a4aea5623d4a5 ("NVMe: Convert to blk-mq")
Cc: <stable(a)vger.kernel.org> # 4.4.x
Cc: <stable(a)vger.kernel.org> # 4.1.x
Signed-off-by: Keith Busch <keith.busch(a)intel.com>
---
drivers/nvme/host/pci.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index 669edbd47602..d6ceb8b91cd6 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -350,8 +350,8 @@ static void async_completion(struct nvme_queue *nvmeq, void *ctx,
struct async_cmd_info *cmdinfo = ctx;
cmdinfo->result = le32_to_cpup(&cqe->result);
cmdinfo->status = le16_to_cpup(&cqe->status) >> 1;
- queue_kthread_work(cmdinfo->worker, &cmdinfo->work);
blk_mq_free_request(cmdinfo->req);
+ queue_kthread_work(cmdinfo->worker, &cmdinfo->work);
}
static inline struct nvme_cmd_info *get_cmd_from_tag(struct nvme_queue *nvmeq,
--
2.13.6
This is a note to let you know that I've just added the patch titled
mm/page_alloc.c: broken deferred calculation
to the 4.4-stable tree which can be found at:
http://www.kernel.org/git/?p=linux/kernel/git/stable/stable-queue.git;a=sum…
The filename of the patch is:
mm-page_alloc.c-broken-deferred-calculation.patch
and it can be found in the queue-4.4 subdirectory.
If you, or anyone else, feels it should not be added to the stable tree,
please let <stable(a)vger.kernel.org> know about it.
>From d135e5750205a21a212a19dbb05aeb339e2cbea7 Mon Sep 17 00:00:00 2001
From: Pavel Tatashin <pasha.tatashin(a)oracle.com>
Date: Wed, 15 Nov 2017 17:38:41 -0800
Subject: mm/page_alloc.c: broken deferred calculation
From: Pavel Tatashin <pasha.tatashin(a)oracle.com>
commit d135e5750205a21a212a19dbb05aeb339e2cbea7 upstream.
In reset_deferred_meminit() we determine number of pages that must not
be deferred. We initialize pages for at least 2G of memory, but also
pages for reserved memory in this node.
The reserved memory is determined in this function:
memblock_reserved_memory_within(), which operates over physical
addresses, and returns size in bytes. However, reset_deferred_meminit()
assumes that that this function operates with pfns, and returns page
count.
The result is that in the best case machine boots slower than expected
due to initializing more pages than needed in single thread, and in the
worst case panics because fewer than needed pages are initialized early.
Link: http://lkml.kernel.org/r/20171021011707.15191-1-pasha.tatashin@oracle.com
Fixes: 864b9a393dcb ("mm: consider memblock reservations for deferred memory initialization sizing")
Signed-off-by: Pavel Tatashin <pasha.tatashin(a)oracle.com>
Acked-by: Michal Hocko <mhocko(a)suse.com>
Cc: Mel Gorman <mgorman(a)techsingularity.net>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds(a)linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh(a)linuxfoundation.org>
---
include/linux/mmzone.h | 3 ++-
mm/page_alloc.c | 27 ++++++++++++++++++---------
2 files changed, 20 insertions(+), 10 deletions(-)
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -688,7 +688,8 @@ typedef struct pglist_data {
* is the first PFN that needs to be initialised.
*/
unsigned long first_deferred_pfn;
- unsigned long static_init_size;
+ /* Number of non-deferred pages */
+ unsigned long static_init_pgcnt;
#endif /* CONFIG_DEFERRED_STRUCT_PAGE_INIT */
} pg_data_t;
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -267,28 +267,37 @@ EXPORT_SYMBOL(nr_online_nodes);
int page_group_by_mobility_disabled __read_mostly;
#ifdef CONFIG_DEFERRED_STRUCT_PAGE_INIT
+
+/*
+ * Determine how many pages need to be initialized durig early boot
+ * (non-deferred initialization).
+ * The value of first_deferred_pfn will be set later, once non-deferred pages
+ * are initialized, but for now set it ULONG_MAX.
+ */
static inline void reset_deferred_meminit(pg_data_t *pgdat)
{
- unsigned long max_initialise;
- unsigned long reserved_lowmem;
+ phys_addr_t start_addr, end_addr;
+ unsigned long max_pgcnt;
+ unsigned long reserved;
/*
* Initialise at least 2G of a node but also take into account that
* two large system hashes that can take up 1GB for 0.25TB/node.
*/
- max_initialise = max(2UL << (30 - PAGE_SHIFT),
- (pgdat->node_spanned_pages >> 8));
+ max_pgcnt = max(2UL << (30 - PAGE_SHIFT),
+ (pgdat->node_spanned_pages >> 8));
/*
* Compensate the all the memblock reservations (e.g. crash kernel)
* from the initial estimation to make sure we will initialize enough
* memory to boot.
*/
- reserved_lowmem = memblock_reserved_memory_within(pgdat->node_start_pfn,
- pgdat->node_start_pfn + max_initialise);
- max_initialise += reserved_lowmem;
+ start_addr = PFN_PHYS(pgdat->node_start_pfn);
+ end_addr = PFN_PHYS(pgdat->node_start_pfn + max_pgcnt);
+ reserved = memblock_reserved_memory_within(start_addr, end_addr);
+ max_pgcnt += PHYS_PFN(reserved);
- pgdat->static_init_size = min(max_initialise, pgdat->node_spanned_pages);
+ pgdat->static_init_pgcnt = min(max_pgcnt, pgdat->node_spanned_pages);
pgdat->first_deferred_pfn = ULONG_MAX;
}
@@ -324,7 +333,7 @@ static inline bool update_defer_init(pg_
return true;
/* Initialise at least 2G of the highest zone */
(*nr_initialised)++;
- if ((*nr_initialised > pgdat->static_init_size) &&
+ if ((*nr_initialised > pgdat->static_init_pgcnt) &&
(pfn & (PAGES_PER_SECTION - 1)) == 0) {
pgdat->first_deferred_pfn = pfn;
return false;
Patches currently in stable-queue which might be from pasha.tatashin(a)oracle.com are
queue-4.4/mm-page_alloc.c-broken-deferred-calculation.patch
This is a note to let you know that I've just added the patch titled
ipmi: fix unsigned long underflow
to the 4.4-stable tree which can be found at:
http://www.kernel.org/git/?p=linux/kernel/git/stable/stable-queue.git;a=sum…
The filename of the patch is:
ipmi-fix-unsigned-long-underflow.patch
and it can be found in the queue-4.4 subdirectory.
If you, or anyone else, feels it should not be added to the stable tree,
please let <stable(a)vger.kernel.org> know about it.
>From 392a17b10ec4320d3c0e96e2a23ebaad1123b989 Mon Sep 17 00:00:00 2001
From: Corey Minyard <cminyard(a)mvista.com>
Date: Sat, 29 Jul 2017 21:14:55 -0500
Subject: ipmi: fix unsigned long underflow
From: Corey Minyard <cminyard(a)mvista.com>
commit 392a17b10ec4320d3c0e96e2a23ebaad1123b989 upstream.
When I set the timeout to a specific value such as 500ms, the timeout
event will not happen in time due to the overflow in function
check_msg_timeout:
...
ent->timeout -= timeout_period;
if (ent->timeout > 0)
return;
...
The type of timeout_period is long, but ent->timeout is unsigned long.
This patch makes the type consistent.
Reported-by: Weilong Chen <chenweilong(a)huawei.com>
Signed-off-by: Corey Minyard <cminyard(a)mvista.com>
Tested-by: Weilong Chen <chenweilong(a)huawei.com>
Signed-off-by: Greg Kroah-Hartman <gregkh(a)linuxfoundation.org>
---
drivers/char/ipmi/ipmi_msghandler.c | 10 ++++++----
1 file changed, 6 insertions(+), 4 deletions(-)
--- a/drivers/char/ipmi/ipmi_msghandler.c
+++ b/drivers/char/ipmi/ipmi_msghandler.c
@@ -4029,7 +4029,8 @@ smi_from_recv_msg(ipmi_smi_t intf, struc
}
static void check_msg_timeout(ipmi_smi_t intf, struct seq_table *ent,
- struct list_head *timeouts, long timeout_period,
+ struct list_head *timeouts,
+ unsigned long timeout_period,
int slot, unsigned long *flags,
unsigned int *waiting_msgs)
{
@@ -4042,8 +4043,8 @@ static void check_msg_timeout(ipmi_smi_t
if (!ent->inuse)
return;
- ent->timeout -= timeout_period;
- if (ent->timeout > 0) {
+ if (timeout_period < ent->timeout) {
+ ent->timeout -= timeout_period;
(*waiting_msgs)++;
return;
}
@@ -4109,7 +4110,8 @@ static void check_msg_timeout(ipmi_smi_t
}
}
-static unsigned int ipmi_timeout_handler(ipmi_smi_t intf, long timeout_period)
+static unsigned int ipmi_timeout_handler(ipmi_smi_t intf,
+ unsigned long timeout_period)
{
struct list_head timeouts;
struct ipmi_recv_msg *msg, *msg2;
Patches currently in stable-queue which might be from cminyard(a)mvista.com are
queue-4.4/ipmi-fix-unsigned-long-underflow.patch
This is a note to let you know that I've just added the patch titled
ocfs2: should wait dio before inode lock in ocfs2_setattr()
to the 4.14-stable tree which can be found at:
http://www.kernel.org/git/?p=linux/kernel/git/stable/stable-queue.git;a=sum…
The filename of the patch is:
ocfs2-should-wait-dio-before-inode-lock-in-ocfs2_setattr.patch
and it can be found in the queue-4.14 subdirectory.
If you, or anyone else, feels it should not be added to the stable tree,
please let <stable(a)vger.kernel.org> know about it.
>From 28f5a8a7c033cbf3e32277f4cc9c6afd74f05300 Mon Sep 17 00:00:00 2001
From: alex chen <alex.chen(a)huawei.com>
Date: Wed, 15 Nov 2017 17:31:40 -0800
Subject: ocfs2: should wait dio before inode lock in ocfs2_setattr()
From: alex chen <alex.chen(a)huawei.com>
commit 28f5a8a7c033cbf3e32277f4cc9c6afd74f05300 upstream.
we should wait dio requests to finish before inode lock in
ocfs2_setattr(), otherwise the following deadlock will happen:
process 1 process 2 process 3
truncate file 'A' end_io of writing file 'A' receiving the bast messages
ocfs2_setattr
ocfs2_inode_lock_tracker
ocfs2_inode_lock_full
inode_dio_wait
__inode_dio_wait
-->waiting for all dio
requests finish
dlm_proxy_ast_handler
dlm_do_local_bast
ocfs2_blocking_ast
ocfs2_generic_handle_bast
set OCFS2_LOCK_BLOCKED flag
dio_end_io
dio_bio_end_aio
dio_complete
ocfs2_dio_end_io
ocfs2_dio_end_io_write
ocfs2_inode_lock
__ocfs2_cluster_lock
ocfs2_wait_for_mask
-->waiting for OCFS2_LOCK_BLOCKED
flag to be cleared, that is waiting
for 'process 1' unlocking the inode lock
inode_dio_end
-->here dec the i_dio_count, but will never
be called, so a deadlock happened.
Link: http://lkml.kernel.org/r/59F81636.70508@huawei.com
Signed-off-by: Alex Chen <alex.chen(a)huawei.com>
Reviewed-by: Jun Piao <piaojun(a)huawei.com>
Reviewed-by: Joseph Qi <jiangqi903(a)gmail.com>
Acked-by: Changwei Ge <ge.changwei(a)h3c.com>
Cc: Mark Fasheh <mfasheh(a)versity.com>
Cc: Joel Becker <jlbec(a)evilplan.org>
Cc: Junxiao Bi <junxiao.bi(a)oracle.com>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds(a)linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh(a)linuxfoundation.org>
---
fs/ocfs2/file.c | 9 +++++++--
1 file changed, 7 insertions(+), 2 deletions(-)
--- a/fs/ocfs2/file.c
+++ b/fs/ocfs2/file.c
@@ -1161,6 +1161,13 @@ int ocfs2_setattr(struct dentry *dentry,
}
size_change = S_ISREG(inode->i_mode) && attr->ia_valid & ATTR_SIZE;
if (size_change) {
+ /*
+ * Here we should wait dio to finish before inode lock
+ * to avoid a deadlock between ocfs2_setattr() and
+ * ocfs2_dio_end_io_write()
+ */
+ inode_dio_wait(inode);
+
status = ocfs2_rw_lock(inode, 1);
if (status < 0) {
mlog_errno(status);
@@ -1200,8 +1207,6 @@ int ocfs2_setattr(struct dentry *dentry,
if (status)
goto bail_unlock;
- inode_dio_wait(inode);
-
if (i_size_read(inode) >= attr->ia_size) {
if (ocfs2_should_order_data(inode)) {
status = ocfs2_begin_ordered_truncate(inode,
Patches currently in stable-queue which might be from alex.chen(a)huawei.com are
queue-4.14/ocfs2-should-wait-dio-before-inode-lock-in-ocfs2_setattr.patch
This is a note to let you know that I've just added the patch titled
mm, swap: fix false error message in __swp_swapcount()
to the 4.14-stable tree which can be found at:
http://www.kernel.org/git/?p=linux/kernel/git/stable/stable-queue.git;a=sum…
The filename of the patch is:
mm-swap-fix-false-error-message-in-__swp_swapcount.patch
and it can be found in the queue-4.14 subdirectory.
If you, or anyone else, feels it should not be added to the stable tree,
please let <stable(a)vger.kernel.org> know about it.
>From e9a6effa500526e2a19d5ad042cb758b55b1ef93 Mon Sep 17 00:00:00 2001
From: Huang Ying <huang.ying.caritas(a)gmail.com>
Date: Wed, 15 Nov 2017 17:33:15 -0800
Subject: mm, swap: fix false error message in __swp_swapcount()
From: Huang Ying <huang.ying.caritas(a)gmail.com>
commit e9a6effa500526e2a19d5ad042cb758b55b1ef93 upstream.
When a page fault occurs for a swap entry, the physical swap readahead
(not the VMA base swap readahead) may readahead several swap entries
after the fault swap entry. The readahead algorithm calculates some of
the swap entries to readahead via increasing the offset of the fault
swap entry without checking whether they are beyond the end of the swap
device and it relys on the __swp_swapcount() and swapcache_prepare() to
check it. Although __swp_swapcount() checks for the swap entry passed
in, it will complain with the error message as follow for the expected
invalid swap entry. This may make the end users confused.
swap_info_get: Bad swap offset entry 0200f8a7
To fix the false error message, the swap entry checking is added in
swapin_readahead() to avoid to pass the out-of-bound swap entries and
the swap entry reserved for the swap header to __swp_swapcount() and
swapcache_prepare().
Link: http://lkml.kernel.org/r/20171102054225.22897-1-ying.huang@intel.com
Fixes: e8c26ab60598 ("mm/swap: skip readahead for unreferenced swap slots")
Signed-off-by: "Huang, Ying" <ying.huang(a)intel.com>
Reported-by: Christian Kujau <lists(a)nerdbynature.de>
Acked-by: Minchan Kim <minchan(a)kernel.org>
Suggested-by: Minchan Kim <minchan(a)kernel.org>
Cc: Tim Chen <tim.c.chen(a)linux.intel.com>
Cc: Michal Hocko <mhocko(a)suse.com>
Cc: Hugh Dickins <hughd(a)google.com>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds(a)linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh(a)linuxfoundation.org>
---
mm/swap_state.c | 3 +++
1 file changed, 3 insertions(+)
--- a/mm/swap_state.c
+++ b/mm/swap_state.c
@@ -559,6 +559,7 @@ struct page *swapin_readahead(swp_entry_
unsigned long offset = entry_offset;
unsigned long start_offset, end_offset;
unsigned long mask;
+ struct swap_info_struct *si = swp_swap_info(entry);
struct blk_plug plug;
bool do_poll = true, page_allocated;
@@ -572,6 +573,8 @@ struct page *swapin_readahead(swp_entry_
end_offset = offset | mask;
if (!start_offset) /* First page is swap header. */
start_offset++;
+ if (end_offset >= si->max)
+ end_offset = si->max - 1;
blk_start_plug(&plug);
for (offset = start_offset; offset <= end_offset ; offset++) {
Patches currently in stable-queue which might be from huang.ying.caritas(a)gmail.com are
queue-4.14/mm-swap-fix-false-error-message-in-__swp_swapcount.patch
This is a note to let you know that I've just added the patch titled
ocfs2: fix cluster hang after a node dies
to the 4.14-stable tree which can be found at:
http://www.kernel.org/git/?p=linux/kernel/git/stable/stable-queue.git;a=sum…
The filename of the patch is:
ocfs2-fix-cluster-hang-after-a-node-dies.patch
and it can be found in the queue-4.14 subdirectory.
If you, or anyone else, feels it should not be added to the stable tree,
please let <stable(a)vger.kernel.org> know about it.
>From 1c01967116a678fed8e2c68a6ab82abc8effeddc Mon Sep 17 00:00:00 2001
From: Changwei Ge <ge.changwei(a)h3c.com>
Date: Wed, 15 Nov 2017 17:31:33 -0800
Subject: ocfs2: fix cluster hang after a node dies
From: Changwei Ge <ge.changwei(a)h3c.com>
commit 1c01967116a678fed8e2c68a6ab82abc8effeddc upstream.
When a node dies, other live nodes have to choose a new master for an
existed lock resource mastered by the dead node.
As for ocfs2/dlm implementation, this is done by function -
dlm_move_lockres_to_recovery_list which marks those lock rsources as
DLM_LOCK_RES_RECOVERING and manages them via a list from which DLM
changes lock resource's master later.
So without invoking dlm_move_lockres_to_recovery_list, no master will be
choosed after dlm recovery accomplishment since no lock resource can be
found through ::resource list.
What's worse is that if DLM_LOCK_RES_RECOVERING is not marked for lock
resources mastered a dead node, it will break up synchronization among
nodes.
So invoke dlm_move_lockres_to_recovery_list again.
Fixs: 'commit ee8f7fcbe638 ("ocfs2/dlm: continue to purge recovery lockres when recovery master goes down")'
Link: http://lkml.kernel.org/r/63ADC13FD55D6546B7DECE290D39E373CED6E0F9@H3CMLB14-…
Signed-off-by: Changwei Ge <ge.changwei(a)h3c.com>
Reported-by: Vitaly Mayatskih <v.mayatskih(a)gmail.com>
Tested-by: Vitaly Mayatskikh <v.mayatskih(a)gmail.com>
Cc: Mark Fasheh <mfasheh(a)versity.com>
Cc: Joel Becker <jlbec(a)evilplan.org>
Cc: Junxiao Bi <junxiao.bi(a)oracle.com>
Cc: Joseph Qi <jiangqi903(a)gmail.com>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds(a)linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh(a)linuxfoundation.org>
---
fs/ocfs2/dlm/dlmrecovery.c | 1 +
1 file changed, 1 insertion(+)
--- a/fs/ocfs2/dlm/dlmrecovery.c
+++ b/fs/ocfs2/dlm/dlmrecovery.c
@@ -2419,6 +2419,7 @@ static void dlm_do_local_recovery_cleanu
dlm_lockres_put(res);
continue;
}
+ dlm_move_lockres_to_recovery_list(dlm, res);
} else if (res->owner == dlm->node_num) {
dlm_free_dead_locks(dlm, res, dead_node);
__dlm_lockres_calc_usage(dlm, res);
Patches currently in stable-queue which might be from ge.changwei(a)h3c.com are
queue-4.14/ocfs2-fix-cluster-hang-after-a-node-dies.patch
queue-4.14/ocfs2-should-wait-dio-before-inode-lock-in-ocfs2_setattr.patch
This is a note to let you know that I've just added the patch titled
mm/pagewalk.c: report holes in hugetlb ranges
to the 4.14-stable tree which can be found at:
http://www.kernel.org/git/?p=linux/kernel/git/stable/stable-queue.git;a=sum…
The filename of the patch is:
mm-pagewalk.c-report-holes-in-hugetlb-ranges.patch
and it can be found in the queue-4.14 subdirectory.
If you, or anyone else, feels it should not be added to the stable tree,
please let <stable(a)vger.kernel.org> know about it.
>From 373c4557d2aa362702c4c2d41288fb1e54990b7c Mon Sep 17 00:00:00 2001
From: Jann Horn <jannh(a)google.com>
Date: Tue, 14 Nov 2017 01:03:44 +0100
Subject: mm/pagewalk.c: report holes in hugetlb ranges
From: Jann Horn <jannh(a)google.com>
commit 373c4557d2aa362702c4c2d41288fb1e54990b7c upstream.
This matters at least for the mincore syscall, which will otherwise copy
uninitialized memory from the page allocator to userspace. It is
probably also a correctness error for /proc/$pid/pagemap, but I haven't
tested that.
Removing the `walk->hugetlb_entry` condition in walk_hugetlb_range() has
no effect because the caller already checks for that.
This only reports holes in hugetlb ranges to callers who have specified
a hugetlb_entry callback.
This issue was found using an AFL-based fuzzer.
v2:
- don't crash on ->pte_hole==NULL (Andrew Morton)
- add Cc stable (Andrew Morton)
Fixes: 1e25a271c8ac ("mincore: apply page table walker on do_mincore()")
Signed-off-by: Jann Horn <jannh(a)google.com>
Signed-off-by: Linus Torvalds <torvalds(a)linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh(a)linuxfoundation.org>
---
mm/pagewalk.c | 6 +++++-
1 file changed, 5 insertions(+), 1 deletion(-)
--- a/mm/pagewalk.c
+++ b/mm/pagewalk.c
@@ -188,8 +188,12 @@ static int walk_hugetlb_range(unsigned l
do {
next = hugetlb_entry_end(h, addr, end);
pte = huge_pte_offset(walk->mm, addr & hmask, sz);
- if (pte && walk->hugetlb_entry)
+
+ if (pte)
err = walk->hugetlb_entry(pte, hmask, addr, next, walk);
+ else if (walk->pte_hole)
+ err = walk->pte_hole(addr, next, walk);
+
if (err)
break;
} while (addr = next, addr != end);
Patches currently in stable-queue which might be from jannh(a)google.com are
queue-4.14/mm-pagewalk.c-report-holes-in-hugetlb-ranges.patch
This is a note to let you know that I've just added the patch titled
mm/page_alloc.c: broken deferred calculation
to the 4.14-stable tree which can be found at:
http://www.kernel.org/git/?p=linux/kernel/git/stable/stable-queue.git;a=sum…
The filename of the patch is:
mm-page_alloc.c-broken-deferred-calculation.patch
and it can be found in the queue-4.14 subdirectory.
If you, or anyone else, feels it should not be added to the stable tree,
please let <stable(a)vger.kernel.org> know about it.
>From d135e5750205a21a212a19dbb05aeb339e2cbea7 Mon Sep 17 00:00:00 2001
From: Pavel Tatashin <pasha.tatashin(a)oracle.com>
Date: Wed, 15 Nov 2017 17:38:41 -0800
Subject: mm/page_alloc.c: broken deferred calculation
From: Pavel Tatashin <pasha.tatashin(a)oracle.com>
commit d135e5750205a21a212a19dbb05aeb339e2cbea7 upstream.
In reset_deferred_meminit() we determine number of pages that must not
be deferred. We initialize pages for at least 2G of memory, but also
pages for reserved memory in this node.
The reserved memory is determined in this function:
memblock_reserved_memory_within(), which operates over physical
addresses, and returns size in bytes. However, reset_deferred_meminit()
assumes that that this function operates with pfns, and returns page
count.
The result is that in the best case machine boots slower than expected
due to initializing more pages than needed in single thread, and in the
worst case panics because fewer than needed pages are initialized early.
Link: http://lkml.kernel.org/r/20171021011707.15191-1-pasha.tatashin@oracle.com
Fixes: 864b9a393dcb ("mm: consider memblock reservations for deferred memory initialization sizing")
Signed-off-by: Pavel Tatashin <pasha.tatashin(a)oracle.com>
Acked-by: Michal Hocko <mhocko(a)suse.com>
Cc: Mel Gorman <mgorman(a)techsingularity.net>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds(a)linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh(a)linuxfoundation.org>
---
include/linux/mmzone.h | 3 ++-
mm/page_alloc.c | 27 ++++++++++++++++++---------
2 files changed, 20 insertions(+), 10 deletions(-)
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -700,7 +700,8 @@ typedef struct pglist_data {
* is the first PFN that needs to be initialised.
*/
unsigned long first_deferred_pfn;
- unsigned long static_init_size;
+ /* Number of non-deferred pages */
+ unsigned long static_init_pgcnt;
#endif /* CONFIG_DEFERRED_STRUCT_PAGE_INIT */
#ifdef CONFIG_TRANSPARENT_HUGEPAGE
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -290,28 +290,37 @@ EXPORT_SYMBOL(nr_online_nodes);
int page_group_by_mobility_disabled __read_mostly;
#ifdef CONFIG_DEFERRED_STRUCT_PAGE_INIT
+
+/*
+ * Determine how many pages need to be initialized durig early boot
+ * (non-deferred initialization).
+ * The value of first_deferred_pfn will be set later, once non-deferred pages
+ * are initialized, but for now set it ULONG_MAX.
+ */
static inline void reset_deferred_meminit(pg_data_t *pgdat)
{
- unsigned long max_initialise;
- unsigned long reserved_lowmem;
+ phys_addr_t start_addr, end_addr;
+ unsigned long max_pgcnt;
+ unsigned long reserved;
/*
* Initialise at least 2G of a node but also take into account that
* two large system hashes that can take up 1GB for 0.25TB/node.
*/
- max_initialise = max(2UL << (30 - PAGE_SHIFT),
- (pgdat->node_spanned_pages >> 8));
+ max_pgcnt = max(2UL << (30 - PAGE_SHIFT),
+ (pgdat->node_spanned_pages >> 8));
/*
* Compensate the all the memblock reservations (e.g. crash kernel)
* from the initial estimation to make sure we will initialize enough
* memory to boot.
*/
- reserved_lowmem = memblock_reserved_memory_within(pgdat->node_start_pfn,
- pgdat->node_start_pfn + max_initialise);
- max_initialise += reserved_lowmem;
+ start_addr = PFN_PHYS(pgdat->node_start_pfn);
+ end_addr = PFN_PHYS(pgdat->node_start_pfn + max_pgcnt);
+ reserved = memblock_reserved_memory_within(start_addr, end_addr);
+ max_pgcnt += PHYS_PFN(reserved);
- pgdat->static_init_size = min(max_initialise, pgdat->node_spanned_pages);
+ pgdat->static_init_pgcnt = min(max_pgcnt, pgdat->node_spanned_pages);
pgdat->first_deferred_pfn = ULONG_MAX;
}
@@ -338,7 +347,7 @@ static inline bool update_defer_init(pg_
if (zone_end < pgdat_end_pfn(pgdat))
return true;
(*nr_initialised)++;
- if ((*nr_initialised > pgdat->static_init_size) &&
+ if ((*nr_initialised > pgdat->static_init_pgcnt) &&
(pfn & (PAGES_PER_SECTION - 1)) == 0) {
pgdat->first_deferred_pfn = pfn;
return false;
Patches currently in stable-queue which might be from pasha.tatashin(a)oracle.com are
queue-4.14/mm-page_alloc.c-broken-deferred-calculation.patch
This is a note to let you know that I've just added the patch titled
mm/page_ext.c: check if page_ext is not prepared
to the 4.14-stable tree which can be found at:
http://www.kernel.org/git/?p=linux/kernel/git/stable/stable-queue.git;a=sum…
The filename of the patch is:
mm-page_ext.c-check-if-page_ext-is-not-prepared.patch
and it can be found in the queue-4.14 subdirectory.
If you, or anyone else, feels it should not be added to the stable tree,
please let <stable(a)vger.kernel.org> know about it.
>From e492080e640c2d1235ddf3441cae634cfffef7e1 Mon Sep 17 00:00:00 2001
From: Jaewon Kim <jaewon31.kim(a)samsung.com>
Date: Wed, 15 Nov 2017 17:39:07 -0800
Subject: mm/page_ext.c: check if page_ext is not prepared
From: Jaewon Kim <jaewon31.kim(a)samsung.com>
commit e492080e640c2d1235ddf3441cae634cfffef7e1 upstream.
online_page_ext() and page_ext_init() allocate page_ext for each
section, but they do not allocate if the first PFN is !pfn_present(pfn)
or !pfn_valid(pfn). Then section->page_ext remains as NULL.
lookup_page_ext checks NULL only if CONFIG_DEBUG_VM is enabled. For a
valid PFN, __set_page_owner will try to get page_ext through
lookup_page_ext. Without CONFIG_DEBUG_VM lookup_page_ext will misuse
NULL pointer as value 0. This incurrs invalid address access.
This is the panic example when PFN 0x100000 is not valid but PFN
0x13FC00 is being used for page_ext. section->page_ext is NULL,
get_entry returned invalid page_ext address as 0x1DFA000 for a PFN
0x13FC00.
To avoid this panic, CONFIG_DEBUG_VM should be removed so that page_ext
will be checked at all times.
Unable to handle kernel paging request at virtual address 01dfa014
------------[ cut here ]------------
Kernel BUG at ffffff80082371e0 [verbose debug info unavailable]
Internal error: Oops: 96000045 [#1] PREEMPT SMP
Modules linked in:
PC is at __set_page_owner+0x48/0x78
LR is at __set_page_owner+0x44/0x78
__set_page_owner+0x48/0x78
get_page_from_freelist+0x880/0x8e8
__alloc_pages_nodemask+0x14c/0xc48
__do_page_cache_readahead+0xdc/0x264
filemap_fault+0x2ac/0x550
ext4_filemap_fault+0x3c/0x58
__do_fault+0x80/0x120
handle_mm_fault+0x704/0xbb0
do_page_fault+0x2e8/0x394
do_mem_abort+0x88/0x124
Pre-4.7 kernels also need commit f86e4271978b ("mm: check the return
value of lookup_page_ext for all call sites").
Link: http://lkml.kernel.org/r/20171107094131.14621-1-jaewon31.kim@samsung.com
Fixes: eefa864b701d ("mm/page_ext: resurrect struct page extending code for debugging")
Signed-off-by: Jaewon Kim <jaewon31.kim(a)samsung.com>
Acked-by: Michal Hocko <mhocko(a)suse.com>
Cc: Vlastimil Babka <vbabka(a)suse.cz>
Cc: Minchan Kim <minchan(a)kernel.org>
Cc: Joonsoo Kim <js1304(a)gmail.com>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds(a)linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh(a)linuxfoundation.org>
---
mm/page_ext.c | 4 ----
1 file changed, 4 deletions(-)
--- a/mm/page_ext.c
+++ b/mm/page_ext.c
@@ -125,7 +125,6 @@ struct page_ext *lookup_page_ext(struct
struct page_ext *base;
base = NODE_DATA(page_to_nid(page))->node_page_ext;
-#if defined(CONFIG_DEBUG_VM)
/*
* The sanity checks the page allocator does upon freeing a
* page can reach here before the page_ext arrays are
@@ -134,7 +133,6 @@ struct page_ext *lookup_page_ext(struct
*/
if (unlikely(!base))
return NULL;
-#endif
index = pfn - round_down(node_start_pfn(page_to_nid(page)),
MAX_ORDER_NR_PAGES);
return get_entry(base, index);
@@ -199,7 +197,6 @@ struct page_ext *lookup_page_ext(struct
{
unsigned long pfn = page_to_pfn(page);
struct mem_section *section = __pfn_to_section(pfn);
-#if defined(CONFIG_DEBUG_VM)
/*
* The sanity checks the page allocator does upon freeing a
* page can reach here before the page_ext arrays are
@@ -208,7 +205,6 @@ struct page_ext *lookup_page_ext(struct
*/
if (!section->page_ext)
return NULL;
-#endif
return get_entry(section->page_ext, pfn);
}
Patches currently in stable-queue which might be from jaewon31.kim(a)samsung.com are
queue-4.14/mm-page_ext.c-check-if-page_ext-is-not-prepared.patch
This is a note to let you know that I've just added the patch titled
ipmi: fix unsigned long underflow
to the 4.14-stable tree which can be found at:
http://www.kernel.org/git/?p=linux/kernel/git/stable/stable-queue.git;a=sum…
The filename of the patch is:
ipmi-fix-unsigned-long-underflow.patch
and it can be found in the queue-4.14 subdirectory.
If you, or anyone else, feels it should not be added to the stable tree,
please let <stable(a)vger.kernel.org> know about it.
>From 392a17b10ec4320d3c0e96e2a23ebaad1123b989 Mon Sep 17 00:00:00 2001
From: Corey Minyard <cminyard(a)mvista.com>
Date: Sat, 29 Jul 2017 21:14:55 -0500
Subject: ipmi: fix unsigned long underflow
From: Corey Minyard <cminyard(a)mvista.com>
commit 392a17b10ec4320d3c0e96e2a23ebaad1123b989 upstream.
When I set the timeout to a specific value such as 500ms, the timeout
event will not happen in time due to the overflow in function
check_msg_timeout:
...
ent->timeout -= timeout_period;
if (ent->timeout > 0)
return;
...
The type of timeout_period is long, but ent->timeout is unsigned long.
This patch makes the type consistent.
Reported-by: Weilong Chen <chenweilong(a)huawei.com>
Signed-off-by: Corey Minyard <cminyard(a)mvista.com>
Tested-by: Weilong Chen <chenweilong(a)huawei.com>
Signed-off-by: Greg Kroah-Hartman <gregkh(a)linuxfoundation.org>
---
drivers/char/ipmi/ipmi_msghandler.c | 10 ++++++----
1 file changed, 6 insertions(+), 4 deletions(-)
--- a/drivers/char/ipmi/ipmi_msghandler.c
+++ b/drivers/char/ipmi/ipmi_msghandler.c
@@ -4030,7 +4030,8 @@ smi_from_recv_msg(ipmi_smi_t intf, struc
}
static void check_msg_timeout(ipmi_smi_t intf, struct seq_table *ent,
- struct list_head *timeouts, long timeout_period,
+ struct list_head *timeouts,
+ unsigned long timeout_period,
int slot, unsigned long *flags,
unsigned int *waiting_msgs)
{
@@ -4043,8 +4044,8 @@ static void check_msg_timeout(ipmi_smi_t
if (!ent->inuse)
return;
- ent->timeout -= timeout_period;
- if (ent->timeout > 0) {
+ if (timeout_period < ent->timeout) {
+ ent->timeout -= timeout_period;
(*waiting_msgs)++;
return;
}
@@ -4110,7 +4111,8 @@ static void check_msg_timeout(ipmi_smi_t
}
}
-static unsigned int ipmi_timeout_handler(ipmi_smi_t intf, long timeout_period)
+static unsigned int ipmi_timeout_handler(ipmi_smi_t intf,
+ unsigned long timeout_period)
{
struct list_head timeouts;
struct ipmi_recv_msg *msg, *msg2;
Patches currently in stable-queue which might be from cminyard(a)mvista.com are
queue-4.14/ipmi-fix-unsigned-long-underflow.patch
This is a note to let you know that I've just added the patch titled
ocfs2: should wait dio before inode lock in ocfs2_setattr()
to the 4.13-stable tree which can be found at:
http://www.kernel.org/git/?p=linux/kernel/git/stable/stable-queue.git;a=sum…
The filename of the patch is:
ocfs2-should-wait-dio-before-inode-lock-in-ocfs2_setattr.patch
and it can be found in the queue-4.13 subdirectory.
If you, or anyone else, feels it should not be added to the stable tree,
please let <stable(a)vger.kernel.org> know about it.
>From 28f5a8a7c033cbf3e32277f4cc9c6afd74f05300 Mon Sep 17 00:00:00 2001
From: alex chen <alex.chen(a)huawei.com>
Date: Wed, 15 Nov 2017 17:31:40 -0800
Subject: ocfs2: should wait dio before inode lock in ocfs2_setattr()
From: alex chen <alex.chen(a)huawei.com>
commit 28f5a8a7c033cbf3e32277f4cc9c6afd74f05300 upstream.
we should wait dio requests to finish before inode lock in
ocfs2_setattr(), otherwise the following deadlock will happen:
process 1 process 2 process 3
truncate file 'A' end_io of writing file 'A' receiving the bast messages
ocfs2_setattr
ocfs2_inode_lock_tracker
ocfs2_inode_lock_full
inode_dio_wait
__inode_dio_wait
-->waiting for all dio
requests finish
dlm_proxy_ast_handler
dlm_do_local_bast
ocfs2_blocking_ast
ocfs2_generic_handle_bast
set OCFS2_LOCK_BLOCKED flag
dio_end_io
dio_bio_end_aio
dio_complete
ocfs2_dio_end_io
ocfs2_dio_end_io_write
ocfs2_inode_lock
__ocfs2_cluster_lock
ocfs2_wait_for_mask
-->waiting for OCFS2_LOCK_BLOCKED
flag to be cleared, that is waiting
for 'process 1' unlocking the inode lock
inode_dio_end
-->here dec the i_dio_count, but will never
be called, so a deadlock happened.
Link: http://lkml.kernel.org/r/59F81636.70508@huawei.com
Signed-off-by: Alex Chen <alex.chen(a)huawei.com>
Reviewed-by: Jun Piao <piaojun(a)huawei.com>
Reviewed-by: Joseph Qi <jiangqi903(a)gmail.com>
Acked-by: Changwei Ge <ge.changwei(a)h3c.com>
Cc: Mark Fasheh <mfasheh(a)versity.com>
Cc: Joel Becker <jlbec(a)evilplan.org>
Cc: Junxiao Bi <junxiao.bi(a)oracle.com>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds(a)linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh(a)linuxfoundation.org>
---
fs/ocfs2/file.c | 9 +++++++--
1 file changed, 7 insertions(+), 2 deletions(-)
--- a/fs/ocfs2/file.c
+++ b/fs/ocfs2/file.c
@@ -1168,6 +1168,13 @@ int ocfs2_setattr(struct dentry *dentry,
}
size_change = S_ISREG(inode->i_mode) && attr->ia_valid & ATTR_SIZE;
if (size_change) {
+ /*
+ * Here we should wait dio to finish before inode lock
+ * to avoid a deadlock between ocfs2_setattr() and
+ * ocfs2_dio_end_io_write()
+ */
+ inode_dio_wait(inode);
+
status = ocfs2_rw_lock(inode, 1);
if (status < 0) {
mlog_errno(status);
@@ -1207,8 +1214,6 @@ int ocfs2_setattr(struct dentry *dentry,
if (status)
goto bail_unlock;
- inode_dio_wait(inode);
-
if (i_size_read(inode) >= attr->ia_size) {
if (ocfs2_should_order_data(inode)) {
status = ocfs2_begin_ordered_truncate(inode,
Patches currently in stable-queue which might be from alex.chen(a)huawei.com are
queue-4.13/ocfs2-should-wait-dio-before-inode-lock-in-ocfs2_setattr.patch
This is a note to let you know that I've just added the patch titled
rcu: Fix up pending cbs check in rcu_prepare_for_idle
to the 4.13-stable tree which can be found at:
http://www.kernel.org/git/?p=linux/kernel/git/stable/stable-queue.git;a=sum…
The filename of the patch is:
rcu-fix-up-pending-cbs-check-in-rcu_prepare_for_idle.patch
and it can be found in the queue-4.13 subdirectory.
If you, or anyone else, feels it should not be added to the stable tree,
please let <stable(a)vger.kernel.org> know about it.
>From 135bd1a230bb69a68c9808a7d25467318900b80a Mon Sep 17 00:00:00 2001
From: Neeraj Upadhyay <neeraju(a)codeaurora.org>
Date: Mon, 7 Aug 2017 11:20:10 +0530
Subject: rcu: Fix up pending cbs check in rcu_prepare_for_idle
From: Neeraj Upadhyay <neeraju(a)codeaurora.org>
commit 135bd1a230bb69a68c9808a7d25467318900b80a upstream.
The pending-callbacks check in rcu_prepare_for_idle() is backwards.
It should accelerate if there are pending callbacks, but the check
rather uselessly accelerates only if there are no callbacks. This commit
therefore inverts this check.
Fixes: 15fecf89e46a ("srcu: Abstract multi-tail callback list handling")
Signed-off-by: Neeraj Upadhyay <neeraju(a)codeaurora.org>
Signed-off-by: Paul E. McKenney <paulmck(a)linux.vnet.ibm.com>
Signed-off-by: Greg Kroah-Hartman <gregkh(a)linuxfoundation.org>
---
kernel/rcu/tree_plugin.h | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
--- a/kernel/rcu/tree_plugin.h
+++ b/kernel/rcu/tree_plugin.h
@@ -1493,7 +1493,7 @@ static void rcu_prepare_for_idle(void)
rdtp->last_accelerate = jiffies;
for_each_rcu_flavor(rsp) {
rdp = this_cpu_ptr(rsp->rda);
- if (rcu_segcblist_pend_cbs(&rdp->cblist))
+ if (!rcu_segcblist_pend_cbs(&rdp->cblist))
continue;
rnp = rdp->mynode;
raw_spin_lock_rcu_node(rnp); /* irqs already disabled. */
Patches currently in stable-queue which might be from neeraju(a)codeaurora.org are
queue-4.13/rcu-fix-up-pending-cbs-check-in-rcu_prepare_for_idle.patch
This is a note to let you know that I've just added the patch titled
ocfs2: fix cluster hang after a node dies
to the 4.13-stable tree which can be found at:
http://www.kernel.org/git/?p=linux/kernel/git/stable/stable-queue.git;a=sum…
The filename of the patch is:
ocfs2-fix-cluster-hang-after-a-node-dies.patch
and it can be found in the queue-4.13 subdirectory.
If you, or anyone else, feels it should not be added to the stable tree,
please let <stable(a)vger.kernel.org> know about it.
>From 1c01967116a678fed8e2c68a6ab82abc8effeddc Mon Sep 17 00:00:00 2001
From: Changwei Ge <ge.changwei(a)h3c.com>
Date: Wed, 15 Nov 2017 17:31:33 -0800
Subject: ocfs2: fix cluster hang after a node dies
From: Changwei Ge <ge.changwei(a)h3c.com>
commit 1c01967116a678fed8e2c68a6ab82abc8effeddc upstream.
When a node dies, other live nodes have to choose a new master for an
existed lock resource mastered by the dead node.
As for ocfs2/dlm implementation, this is done by function -
dlm_move_lockres_to_recovery_list which marks those lock rsources as
DLM_LOCK_RES_RECOVERING and manages them via a list from which DLM
changes lock resource's master later.
So without invoking dlm_move_lockres_to_recovery_list, no master will be
choosed after dlm recovery accomplishment since no lock resource can be
found through ::resource list.
What's worse is that if DLM_LOCK_RES_RECOVERING is not marked for lock
resources mastered a dead node, it will break up synchronization among
nodes.
So invoke dlm_move_lockres_to_recovery_list again.
Fixs: 'commit ee8f7fcbe638 ("ocfs2/dlm: continue to purge recovery lockres when recovery master goes down")'
Link: http://lkml.kernel.org/r/63ADC13FD55D6546B7DECE290D39E373CED6E0F9@H3CMLB14-…
Signed-off-by: Changwei Ge <ge.changwei(a)h3c.com>
Reported-by: Vitaly Mayatskih <v.mayatskih(a)gmail.com>
Tested-by: Vitaly Mayatskikh <v.mayatskih(a)gmail.com>
Cc: Mark Fasheh <mfasheh(a)versity.com>
Cc: Joel Becker <jlbec(a)evilplan.org>
Cc: Junxiao Bi <junxiao.bi(a)oracle.com>
Cc: Joseph Qi <jiangqi903(a)gmail.com>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds(a)linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh(a)linuxfoundation.org>
---
fs/ocfs2/dlm/dlmrecovery.c | 1 +
1 file changed, 1 insertion(+)
--- a/fs/ocfs2/dlm/dlmrecovery.c
+++ b/fs/ocfs2/dlm/dlmrecovery.c
@@ -2419,6 +2419,7 @@ static void dlm_do_local_recovery_cleanu
dlm_lockres_put(res);
continue;
}
+ dlm_move_lockres_to_recovery_list(dlm, res);
} else if (res->owner == dlm->node_num) {
dlm_free_dead_locks(dlm, res, dead_node);
__dlm_lockres_calc_usage(dlm, res);
Patches currently in stable-queue which might be from ge.changwei(a)h3c.com are
queue-4.13/ocfs2-fix-cluster-hang-after-a-node-dies.patch
queue-4.13/ocfs2-should-wait-dio-before-inode-lock-in-ocfs2_setattr.patch
This is a note to let you know that I've just added the patch titled
mm/pagewalk.c: report holes in hugetlb ranges
to the 4.13-stable tree which can be found at:
http://www.kernel.org/git/?p=linux/kernel/git/stable/stable-queue.git;a=sum…
The filename of the patch is:
mm-pagewalk.c-report-holes-in-hugetlb-ranges.patch
and it can be found in the queue-4.13 subdirectory.
If you, or anyone else, feels it should not be added to the stable tree,
please let <stable(a)vger.kernel.org> know about it.
>From 373c4557d2aa362702c4c2d41288fb1e54990b7c Mon Sep 17 00:00:00 2001
From: Jann Horn <jannh(a)google.com>
Date: Tue, 14 Nov 2017 01:03:44 +0100
Subject: mm/pagewalk.c: report holes in hugetlb ranges
From: Jann Horn <jannh(a)google.com>
commit 373c4557d2aa362702c4c2d41288fb1e54990b7c upstream.
This matters at least for the mincore syscall, which will otherwise copy
uninitialized memory from the page allocator to userspace. It is
probably also a correctness error for /proc/$pid/pagemap, but I haven't
tested that.
Removing the `walk->hugetlb_entry` condition in walk_hugetlb_range() has
no effect because the caller already checks for that.
This only reports holes in hugetlb ranges to callers who have specified
a hugetlb_entry callback.
This issue was found using an AFL-based fuzzer.
v2:
- don't crash on ->pte_hole==NULL (Andrew Morton)
- add Cc stable (Andrew Morton)
Fixes: 1e25a271c8ac ("mincore: apply page table walker on do_mincore()")
Signed-off-by: Jann Horn <jannh(a)google.com>
Signed-off-by: Linus Torvalds <torvalds(a)linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh(a)linuxfoundation.org>
---
mm/pagewalk.c | 6 +++++-
1 file changed, 5 insertions(+), 1 deletion(-)
--- a/mm/pagewalk.c
+++ b/mm/pagewalk.c
@@ -187,8 +187,12 @@ static int walk_hugetlb_range(unsigned l
do {
next = hugetlb_entry_end(h, addr, end);
pte = huge_pte_offset(walk->mm, addr & hmask, sz);
- if (pte && walk->hugetlb_entry)
+
+ if (pte)
err = walk->hugetlb_entry(pte, hmask, addr, next, walk);
+ else if (walk->pte_hole)
+ err = walk->pte_hole(addr, next, walk);
+
if (err)
break;
} while (addr = next, addr != end);
Patches currently in stable-queue which might be from jannh(a)google.com are
queue-4.13/mm-pagewalk.c-report-holes-in-hugetlb-ranges.patch
This is a note to let you know that I've just added the patch titled
mm/page_alloc.c: broken deferred calculation
to the 4.13-stable tree which can be found at:
http://www.kernel.org/git/?p=linux/kernel/git/stable/stable-queue.git;a=sum…
The filename of the patch is:
mm-page_alloc.c-broken-deferred-calculation.patch
and it can be found in the queue-4.13 subdirectory.
If you, or anyone else, feels it should not be added to the stable tree,
please let <stable(a)vger.kernel.org> know about it.
>From d135e5750205a21a212a19dbb05aeb339e2cbea7 Mon Sep 17 00:00:00 2001
From: Pavel Tatashin <pasha.tatashin(a)oracle.com>
Date: Wed, 15 Nov 2017 17:38:41 -0800
Subject: mm/page_alloc.c: broken deferred calculation
From: Pavel Tatashin <pasha.tatashin(a)oracle.com>
commit d135e5750205a21a212a19dbb05aeb339e2cbea7 upstream.
In reset_deferred_meminit() we determine number of pages that must not
be deferred. We initialize pages for at least 2G of memory, but also
pages for reserved memory in this node.
The reserved memory is determined in this function:
memblock_reserved_memory_within(), which operates over physical
addresses, and returns size in bytes. However, reset_deferred_meminit()
assumes that that this function operates with pfns, and returns page
count.
The result is that in the best case machine boots slower than expected
due to initializing more pages than needed in single thread, and in the
worst case panics because fewer than needed pages are initialized early.
Link: http://lkml.kernel.org/r/20171021011707.15191-1-pasha.tatashin@oracle.com
Fixes: 864b9a393dcb ("mm: consider memblock reservations for deferred memory initialization sizing")
Signed-off-by: Pavel Tatashin <pasha.tatashin(a)oracle.com>
Acked-by: Michal Hocko <mhocko(a)suse.com>
Cc: Mel Gorman <mgorman(a)techsingularity.net>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds(a)linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh(a)linuxfoundation.org>
---
include/linux/mmzone.h | 3 ++-
mm/page_alloc.c | 27 ++++++++++++++++++---------
2 files changed, 20 insertions(+), 10 deletions(-)
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -691,7 +691,8 @@ typedef struct pglist_data {
* is the first PFN that needs to be initialised.
*/
unsigned long first_deferred_pfn;
- unsigned long static_init_size;
+ /* Number of non-deferred pages */
+ unsigned long static_init_pgcnt;
#endif /* CONFIG_DEFERRED_STRUCT_PAGE_INIT */
#ifdef CONFIG_TRANSPARENT_HUGEPAGE
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -289,28 +289,37 @@ EXPORT_SYMBOL(nr_online_nodes);
int page_group_by_mobility_disabled __read_mostly;
#ifdef CONFIG_DEFERRED_STRUCT_PAGE_INIT
+
+/*
+ * Determine how many pages need to be initialized durig early boot
+ * (non-deferred initialization).
+ * The value of first_deferred_pfn will be set later, once non-deferred pages
+ * are initialized, but for now set it ULONG_MAX.
+ */
static inline void reset_deferred_meminit(pg_data_t *pgdat)
{
- unsigned long max_initialise;
- unsigned long reserved_lowmem;
+ phys_addr_t start_addr, end_addr;
+ unsigned long max_pgcnt;
+ unsigned long reserved;
/*
* Initialise at least 2G of a node but also take into account that
* two large system hashes that can take up 1GB for 0.25TB/node.
*/
- max_initialise = max(2UL << (30 - PAGE_SHIFT),
- (pgdat->node_spanned_pages >> 8));
+ max_pgcnt = max(2UL << (30 - PAGE_SHIFT),
+ (pgdat->node_spanned_pages >> 8));
/*
* Compensate the all the memblock reservations (e.g. crash kernel)
* from the initial estimation to make sure we will initialize enough
* memory to boot.
*/
- reserved_lowmem = memblock_reserved_memory_within(pgdat->node_start_pfn,
- pgdat->node_start_pfn + max_initialise);
- max_initialise += reserved_lowmem;
+ start_addr = PFN_PHYS(pgdat->node_start_pfn);
+ end_addr = PFN_PHYS(pgdat->node_start_pfn + max_pgcnt);
+ reserved = memblock_reserved_memory_within(start_addr, end_addr);
+ max_pgcnt += PHYS_PFN(reserved);
- pgdat->static_init_size = min(max_initialise, pgdat->node_spanned_pages);
+ pgdat->static_init_pgcnt = min(max_pgcnt, pgdat->node_spanned_pages);
pgdat->first_deferred_pfn = ULONG_MAX;
}
@@ -337,7 +346,7 @@ static inline bool update_defer_init(pg_
if (zone_end < pgdat_end_pfn(pgdat))
return true;
(*nr_initialised)++;
- if ((*nr_initialised > pgdat->static_init_size) &&
+ if ((*nr_initialised > pgdat->static_init_pgcnt) &&
(pfn & (PAGES_PER_SECTION - 1)) == 0) {
pgdat->first_deferred_pfn = pfn;
return false;
Patches currently in stable-queue which might be from pasha.tatashin(a)oracle.com are
queue-4.13/mm-page_alloc.c-broken-deferred-calculation.patch
This is a note to let you know that I've just added the patch titled
mm/page_ext.c: check if page_ext is not prepared
to the 4.13-stable tree which can be found at:
http://www.kernel.org/git/?p=linux/kernel/git/stable/stable-queue.git;a=sum…
The filename of the patch is:
mm-page_ext.c-check-if-page_ext-is-not-prepared.patch
and it can be found in the queue-4.13 subdirectory.
If you, or anyone else, feels it should not be added to the stable tree,
please let <stable(a)vger.kernel.org> know about it.
>From e492080e640c2d1235ddf3441cae634cfffef7e1 Mon Sep 17 00:00:00 2001
From: Jaewon Kim <jaewon31.kim(a)samsung.com>
Date: Wed, 15 Nov 2017 17:39:07 -0800
Subject: mm/page_ext.c: check if page_ext is not prepared
From: Jaewon Kim <jaewon31.kim(a)samsung.com>
commit e492080e640c2d1235ddf3441cae634cfffef7e1 upstream.
online_page_ext() and page_ext_init() allocate page_ext for each
section, but they do not allocate if the first PFN is !pfn_present(pfn)
or !pfn_valid(pfn). Then section->page_ext remains as NULL.
lookup_page_ext checks NULL only if CONFIG_DEBUG_VM is enabled. For a
valid PFN, __set_page_owner will try to get page_ext through
lookup_page_ext. Without CONFIG_DEBUG_VM lookup_page_ext will misuse
NULL pointer as value 0. This incurrs invalid address access.
This is the panic example when PFN 0x100000 is not valid but PFN
0x13FC00 is being used for page_ext. section->page_ext is NULL,
get_entry returned invalid page_ext address as 0x1DFA000 for a PFN
0x13FC00.
To avoid this panic, CONFIG_DEBUG_VM should be removed so that page_ext
will be checked at all times.
Unable to handle kernel paging request at virtual address 01dfa014
------------[ cut here ]------------
Kernel BUG at ffffff80082371e0 [verbose debug info unavailable]
Internal error: Oops: 96000045 [#1] PREEMPT SMP
Modules linked in:
PC is at __set_page_owner+0x48/0x78
LR is at __set_page_owner+0x44/0x78
__set_page_owner+0x48/0x78
get_page_from_freelist+0x880/0x8e8
__alloc_pages_nodemask+0x14c/0xc48
__do_page_cache_readahead+0xdc/0x264
filemap_fault+0x2ac/0x550
ext4_filemap_fault+0x3c/0x58
__do_fault+0x80/0x120
handle_mm_fault+0x704/0xbb0
do_page_fault+0x2e8/0x394
do_mem_abort+0x88/0x124
Pre-4.7 kernels also need commit f86e4271978b ("mm: check the return
value of lookup_page_ext for all call sites").
Link: http://lkml.kernel.org/r/20171107094131.14621-1-jaewon31.kim@samsung.com
Fixes: eefa864b701d ("mm/page_ext: resurrect struct page extending code for debugging")
Signed-off-by: Jaewon Kim <jaewon31.kim(a)samsung.com>
Acked-by: Michal Hocko <mhocko(a)suse.com>
Cc: Vlastimil Babka <vbabka(a)suse.cz>
Cc: Minchan Kim <minchan(a)kernel.org>
Cc: Joonsoo Kim <js1304(a)gmail.com>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds(a)linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh(a)linuxfoundation.org>
---
mm/page_ext.c | 4 ----
1 file changed, 4 deletions(-)
--- a/mm/page_ext.c
+++ b/mm/page_ext.c
@@ -124,7 +124,6 @@ struct page_ext *lookup_page_ext(struct
struct page_ext *base;
base = NODE_DATA(page_to_nid(page))->node_page_ext;
-#if defined(CONFIG_DEBUG_VM)
/*
* The sanity checks the page allocator does upon freeing a
* page can reach here before the page_ext arrays are
@@ -133,7 +132,6 @@ struct page_ext *lookup_page_ext(struct
*/
if (unlikely(!base))
return NULL;
-#endif
index = pfn - round_down(node_start_pfn(page_to_nid(page)),
MAX_ORDER_NR_PAGES);
return get_entry(base, index);
@@ -198,7 +196,6 @@ struct page_ext *lookup_page_ext(struct
{
unsigned long pfn = page_to_pfn(page);
struct mem_section *section = __pfn_to_section(pfn);
-#if defined(CONFIG_DEBUG_VM)
/*
* The sanity checks the page allocator does upon freeing a
* page can reach here before the page_ext arrays are
@@ -207,7 +204,6 @@ struct page_ext *lookup_page_ext(struct
*/
if (!section->page_ext)
return NULL;
-#endif
return get_entry(section->page_ext, pfn);
}
Patches currently in stable-queue which might be from jaewon31.kim(a)samsung.com are
queue-4.13/mm-page_ext.c-check-if-page_ext-is-not-prepared.patch
This is a note to let you know that I've just added the patch titled
ipmi: fix unsigned long underflow
to the 4.13-stable tree which can be found at:
http://www.kernel.org/git/?p=linux/kernel/git/stable/stable-queue.git;a=sum…
The filename of the patch is:
ipmi-fix-unsigned-long-underflow.patch
and it can be found in the queue-4.13 subdirectory.
If you, or anyone else, feels it should not be added to the stable tree,
please let <stable(a)vger.kernel.org> know about it.
>From 392a17b10ec4320d3c0e96e2a23ebaad1123b989 Mon Sep 17 00:00:00 2001
From: Corey Minyard <cminyard(a)mvista.com>
Date: Sat, 29 Jul 2017 21:14:55 -0500
Subject: ipmi: fix unsigned long underflow
From: Corey Minyard <cminyard(a)mvista.com>
commit 392a17b10ec4320d3c0e96e2a23ebaad1123b989 upstream.
When I set the timeout to a specific value such as 500ms, the timeout
event will not happen in time due to the overflow in function
check_msg_timeout:
...
ent->timeout -= timeout_period;
if (ent->timeout > 0)
return;
...
The type of timeout_period is long, but ent->timeout is unsigned long.
This patch makes the type consistent.
Reported-by: Weilong Chen <chenweilong(a)huawei.com>
Signed-off-by: Corey Minyard <cminyard(a)mvista.com>
Tested-by: Weilong Chen <chenweilong(a)huawei.com>
Signed-off-by: Greg Kroah-Hartman <gregkh(a)linuxfoundation.org>
---
drivers/char/ipmi/ipmi_msghandler.c | 10 ++++++----
1 file changed, 6 insertions(+), 4 deletions(-)
--- a/drivers/char/ipmi/ipmi_msghandler.c
+++ b/drivers/char/ipmi/ipmi_msghandler.c
@@ -4030,7 +4030,8 @@ smi_from_recv_msg(ipmi_smi_t intf, struc
}
static void check_msg_timeout(ipmi_smi_t intf, struct seq_table *ent,
- struct list_head *timeouts, long timeout_period,
+ struct list_head *timeouts,
+ unsigned long timeout_period,
int slot, unsigned long *flags,
unsigned int *waiting_msgs)
{
@@ -4043,8 +4044,8 @@ static void check_msg_timeout(ipmi_smi_t
if (!ent->inuse)
return;
- ent->timeout -= timeout_period;
- if (ent->timeout > 0) {
+ if (timeout_period < ent->timeout) {
+ ent->timeout -= timeout_period;
(*waiting_msgs)++;
return;
}
@@ -4110,7 +4111,8 @@ static void check_msg_timeout(ipmi_smi_t
}
}
-static unsigned int ipmi_timeout_handler(ipmi_smi_t intf, long timeout_period)
+static unsigned int ipmi_timeout_handler(ipmi_smi_t intf,
+ unsigned long timeout_period)
{
struct list_head timeouts;
struct ipmi_recv_msg *msg, *msg2;
Patches currently in stable-queue which might be from cminyard(a)mvista.com are
queue-4.13/ipmi-fix-unsigned-long-underflow.patch
This is a note to let you know that I've just added the patch titled
ocfs2: should wait dio before inode lock in ocfs2_setattr()
to the 3.18-stable tree which can be found at:
http://www.kernel.org/git/?p=linux/kernel/git/stable/stable-queue.git;a=sum…
The filename of the patch is:
ocfs2-should-wait-dio-before-inode-lock-in-ocfs2_setattr.patch
and it can be found in the queue-3.18 subdirectory.
If you, or anyone else, feels it should not be added to the stable tree,
please let <stable(a)vger.kernel.org> know about it.
>From 28f5a8a7c033cbf3e32277f4cc9c6afd74f05300 Mon Sep 17 00:00:00 2001
From: alex chen <alex.chen(a)huawei.com>
Date: Wed, 15 Nov 2017 17:31:40 -0800
Subject: ocfs2: should wait dio before inode lock in ocfs2_setattr()
From: alex chen <alex.chen(a)huawei.com>
commit 28f5a8a7c033cbf3e32277f4cc9c6afd74f05300 upstream.
we should wait dio requests to finish before inode lock in
ocfs2_setattr(), otherwise the following deadlock will happen:
process 1 process 2 process 3
truncate file 'A' end_io of writing file 'A' receiving the bast messages
ocfs2_setattr
ocfs2_inode_lock_tracker
ocfs2_inode_lock_full
inode_dio_wait
__inode_dio_wait
-->waiting for all dio
requests finish
dlm_proxy_ast_handler
dlm_do_local_bast
ocfs2_blocking_ast
ocfs2_generic_handle_bast
set OCFS2_LOCK_BLOCKED flag
dio_end_io
dio_bio_end_aio
dio_complete
ocfs2_dio_end_io
ocfs2_dio_end_io_write
ocfs2_inode_lock
__ocfs2_cluster_lock
ocfs2_wait_for_mask
-->waiting for OCFS2_LOCK_BLOCKED
flag to be cleared, that is waiting
for 'process 1' unlocking the inode lock
inode_dio_end
-->here dec the i_dio_count, but will never
be called, so a deadlock happened.
Link: http://lkml.kernel.org/r/59F81636.70508@huawei.com
Signed-off-by: Alex Chen <alex.chen(a)huawei.com>
Reviewed-by: Jun Piao <piaojun(a)huawei.com>
Reviewed-by: Joseph Qi <jiangqi903(a)gmail.com>
Acked-by: Changwei Ge <ge.changwei(a)h3c.com>
Cc: Mark Fasheh <mfasheh(a)versity.com>
Cc: Joel Becker <jlbec(a)evilplan.org>
Cc: Junxiao Bi <junxiao.bi(a)oracle.com>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds(a)linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh(a)linuxfoundation.org>
---
fs/ocfs2/file.c | 9 +++++++--
1 file changed, 7 insertions(+), 2 deletions(-)
--- a/fs/ocfs2/file.c
+++ b/fs/ocfs2/file.c
@@ -1151,6 +1151,13 @@ int ocfs2_setattr(struct dentry *dentry,
dquot_initialize(inode);
size_change = S_ISREG(inode->i_mode) && attr->ia_valid & ATTR_SIZE;
if (size_change) {
+ /*
+ * Here we should wait dio to finish before inode lock
+ * to avoid a deadlock between ocfs2_setattr() and
+ * ocfs2_dio_end_io_write()
+ */
+ inode_dio_wait(inode);
+
status = ocfs2_rw_lock(inode, 1);
if (status < 0) {
mlog_errno(status);
@@ -1170,8 +1177,6 @@ int ocfs2_setattr(struct dentry *dentry,
if (status)
goto bail_unlock;
- inode_dio_wait(inode);
-
if (i_size_read(inode) >= attr->ia_size) {
if (ocfs2_should_order_data(inode)) {
status = ocfs2_begin_ordered_truncate(inode,
Patches currently in stable-queue which might be from alex.chen(a)huawei.com are
queue-3.18/ocfs2-should-wait-dio-before-inode-lock-in-ocfs2_setattr.patch
This is a note to let you know that I've just added the patch titled
ipmi: fix unsigned long underflow
to the 3.18-stable tree which can be found at:
http://www.kernel.org/git/?p=linux/kernel/git/stable/stable-queue.git;a=sum…
The filename of the patch is:
ipmi-fix-unsigned-long-underflow.patch
and it can be found in the queue-3.18 subdirectory.
If you, or anyone else, feels it should not be added to the stable tree,
please let <stable(a)vger.kernel.org> know about it.
>From 392a17b10ec4320d3c0e96e2a23ebaad1123b989 Mon Sep 17 00:00:00 2001
From: Corey Minyard <cminyard(a)mvista.com>
Date: Sat, 29 Jul 2017 21:14:55 -0500
Subject: ipmi: fix unsigned long underflow
From: Corey Minyard <cminyard(a)mvista.com>
commit 392a17b10ec4320d3c0e96e2a23ebaad1123b989 upstream.
When I set the timeout to a specific value such as 500ms, the timeout
event will not happen in time due to the overflow in function
check_msg_timeout:
...
ent->timeout -= timeout_period;
if (ent->timeout > 0)
return;
...
The type of timeout_period is long, but ent->timeout is unsigned long.
This patch makes the type consistent.
Reported-by: Weilong Chen <chenweilong(a)huawei.com>
Signed-off-by: Corey Minyard <cminyard(a)mvista.com>
Tested-by: Weilong Chen <chenweilong(a)huawei.com>
Signed-off-by: Greg Kroah-Hartman <gregkh(a)linuxfoundation.org>
---
drivers/char/ipmi/ipmi_msghandler.c | 10 ++++++----
1 file changed, 6 insertions(+), 4 deletions(-)
--- a/drivers/char/ipmi/ipmi_msghandler.c
+++ b/drivers/char/ipmi/ipmi_msghandler.c
@@ -4010,7 +4010,8 @@ smi_from_recv_msg(ipmi_smi_t intf, struc
}
static void check_msg_timeout(ipmi_smi_t intf, struct seq_table *ent,
- struct list_head *timeouts, long timeout_period,
+ struct list_head *timeouts,
+ unsigned long timeout_period,
int slot, unsigned long *flags,
unsigned int *waiting_msgs)
{
@@ -4023,8 +4024,8 @@ static void check_msg_timeout(ipmi_smi_t
if (!ent->inuse)
return;
- ent->timeout -= timeout_period;
- if (ent->timeout > 0) {
+ if (timeout_period < ent->timeout) {
+ ent->timeout -= timeout_period;
(*waiting_msgs)++;
return;
}
@@ -4091,7 +4092,8 @@ static void check_msg_timeout(ipmi_smi_t
}
}
-static unsigned int ipmi_timeout_handler(ipmi_smi_t intf, long timeout_period)
+static unsigned int ipmi_timeout_handler(ipmi_smi_t intf,
+ unsigned long timeout_period)
{
struct list_head timeouts;
struct ipmi_recv_msg *msg, *msg2;
Patches currently in stable-queue which might be from cminyard(a)mvista.com are
queue-3.18/ipmi-fix-unsigned-long-underflow.patch
The patch below does not apply to the 4.9-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
>From e492080e640c2d1235ddf3441cae634cfffef7e1 Mon Sep 17 00:00:00 2001
From: Jaewon Kim <jaewon31.kim(a)samsung.com>
Date: Wed, 15 Nov 2017 17:39:07 -0800
Subject: [PATCH] mm/page_ext.c: check if page_ext is not prepared
online_page_ext() and page_ext_init() allocate page_ext for each
section, but they do not allocate if the first PFN is !pfn_present(pfn)
or !pfn_valid(pfn). Then section->page_ext remains as NULL.
lookup_page_ext checks NULL only if CONFIG_DEBUG_VM is enabled. For a
valid PFN, __set_page_owner will try to get page_ext through
lookup_page_ext. Without CONFIG_DEBUG_VM lookup_page_ext will misuse
NULL pointer as value 0. This incurrs invalid address access.
This is the panic example when PFN 0x100000 is not valid but PFN
0x13FC00 is being used for page_ext. section->page_ext is NULL,
get_entry returned invalid page_ext address as 0x1DFA000 for a PFN
0x13FC00.
To avoid this panic, CONFIG_DEBUG_VM should be removed so that page_ext
will be checked at all times.
Unable to handle kernel paging request at virtual address 01dfa014
------------[ cut here ]------------
Kernel BUG at ffffff80082371e0 [verbose debug info unavailable]
Internal error: Oops: 96000045 [#1] PREEMPT SMP
Modules linked in:
PC is at __set_page_owner+0x48/0x78
LR is at __set_page_owner+0x44/0x78
__set_page_owner+0x48/0x78
get_page_from_freelist+0x880/0x8e8
__alloc_pages_nodemask+0x14c/0xc48
__do_page_cache_readahead+0xdc/0x264
filemap_fault+0x2ac/0x550
ext4_filemap_fault+0x3c/0x58
__do_fault+0x80/0x120
handle_mm_fault+0x704/0xbb0
do_page_fault+0x2e8/0x394
do_mem_abort+0x88/0x124
Pre-4.7 kernels also need commit f86e4271978b ("mm: check the return
value of lookup_page_ext for all call sites").
Link: http://lkml.kernel.org/r/20171107094131.14621-1-jaewon31.kim@samsung.com
Fixes: eefa864b701d ("mm/page_ext: resurrect struct page extending code for debugging")
Signed-off-by: Jaewon Kim <jaewon31.kim(a)samsung.com>
Acked-by: Michal Hocko <mhocko(a)suse.com>
Cc: Vlastimil Babka <vbabka(a)suse.cz>
Cc: Minchan Kim <minchan(a)kernel.org>
Cc: Joonsoo Kim <js1304(a)gmail.com>
Cc: <stable(a)vger.kernel.org> [depends on f86e427197, see above]
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds(a)linux-foundation.org>
diff --git a/mm/page_ext.c b/mm/page_ext.c
index 4f0367d472c4..2c16216c29b6 100644
--- a/mm/page_ext.c
+++ b/mm/page_ext.c
@@ -125,7 +125,6 @@ struct page_ext *lookup_page_ext(struct page *page)
struct page_ext *base;
base = NODE_DATA(page_to_nid(page))->node_page_ext;
-#if defined(CONFIG_DEBUG_VM)
/*
* The sanity checks the page allocator does upon freeing a
* page can reach here before the page_ext arrays are
@@ -134,7 +133,6 @@ struct page_ext *lookup_page_ext(struct page *page)
*/
if (unlikely(!base))
return NULL;
-#endif
index = pfn - round_down(node_start_pfn(page_to_nid(page)),
MAX_ORDER_NR_PAGES);
return get_entry(base, index);
@@ -199,7 +197,6 @@ struct page_ext *lookup_page_ext(struct page *page)
{
unsigned long pfn = page_to_pfn(page);
struct mem_section *section = __pfn_to_section(pfn);
-#if defined(CONFIG_DEBUG_VM)
/*
* The sanity checks the page allocator does upon freeing a
* page can reach here before the page_ext arrays are
@@ -208,7 +205,6 @@ struct page_ext *lookup_page_ext(struct page *page)
*/
if (!section->page_ext)
return NULL;
-#endif
return get_entry(section->page_ext, pfn);
}
The patch below does not apply to the 4.4-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
>From e492080e640c2d1235ddf3441cae634cfffef7e1 Mon Sep 17 00:00:00 2001
From: Jaewon Kim <jaewon31.kim(a)samsung.com>
Date: Wed, 15 Nov 2017 17:39:07 -0800
Subject: [PATCH] mm/page_ext.c: check if page_ext is not prepared
online_page_ext() and page_ext_init() allocate page_ext for each
section, but they do not allocate if the first PFN is !pfn_present(pfn)
or !pfn_valid(pfn). Then section->page_ext remains as NULL.
lookup_page_ext checks NULL only if CONFIG_DEBUG_VM is enabled. For a
valid PFN, __set_page_owner will try to get page_ext through
lookup_page_ext. Without CONFIG_DEBUG_VM lookup_page_ext will misuse
NULL pointer as value 0. This incurrs invalid address access.
This is the panic example when PFN 0x100000 is not valid but PFN
0x13FC00 is being used for page_ext. section->page_ext is NULL,
get_entry returned invalid page_ext address as 0x1DFA000 for a PFN
0x13FC00.
To avoid this panic, CONFIG_DEBUG_VM should be removed so that page_ext
will be checked at all times.
Unable to handle kernel paging request at virtual address 01dfa014
------------[ cut here ]------------
Kernel BUG at ffffff80082371e0 [verbose debug info unavailable]
Internal error: Oops: 96000045 [#1] PREEMPT SMP
Modules linked in:
PC is at __set_page_owner+0x48/0x78
LR is at __set_page_owner+0x44/0x78
__set_page_owner+0x48/0x78
get_page_from_freelist+0x880/0x8e8
__alloc_pages_nodemask+0x14c/0xc48
__do_page_cache_readahead+0xdc/0x264
filemap_fault+0x2ac/0x550
ext4_filemap_fault+0x3c/0x58
__do_fault+0x80/0x120
handle_mm_fault+0x704/0xbb0
do_page_fault+0x2e8/0x394
do_mem_abort+0x88/0x124
Pre-4.7 kernels also need commit f86e4271978b ("mm: check the return
value of lookup_page_ext for all call sites").
Link: http://lkml.kernel.org/r/20171107094131.14621-1-jaewon31.kim@samsung.com
Fixes: eefa864b701d ("mm/page_ext: resurrect struct page extending code for debugging")
Signed-off-by: Jaewon Kim <jaewon31.kim(a)samsung.com>
Acked-by: Michal Hocko <mhocko(a)suse.com>
Cc: Vlastimil Babka <vbabka(a)suse.cz>
Cc: Minchan Kim <minchan(a)kernel.org>
Cc: Joonsoo Kim <js1304(a)gmail.com>
Cc: <stable(a)vger.kernel.org> [depends on f86e427197, see above]
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds(a)linux-foundation.org>
diff --git a/mm/page_ext.c b/mm/page_ext.c
index 4f0367d472c4..2c16216c29b6 100644
--- a/mm/page_ext.c
+++ b/mm/page_ext.c
@@ -125,7 +125,6 @@ struct page_ext *lookup_page_ext(struct page *page)
struct page_ext *base;
base = NODE_DATA(page_to_nid(page))->node_page_ext;
-#if defined(CONFIG_DEBUG_VM)
/*
* The sanity checks the page allocator does upon freeing a
* page can reach here before the page_ext arrays are
@@ -134,7 +133,6 @@ struct page_ext *lookup_page_ext(struct page *page)
*/
if (unlikely(!base))
return NULL;
-#endif
index = pfn - round_down(node_start_pfn(page_to_nid(page)),
MAX_ORDER_NR_PAGES);
return get_entry(base, index);
@@ -199,7 +197,6 @@ struct page_ext *lookup_page_ext(struct page *page)
{
unsigned long pfn = page_to_pfn(page);
struct mem_section *section = __pfn_to_section(pfn);
-#if defined(CONFIG_DEBUG_VM)
/*
* The sanity checks the page allocator does upon freeing a
* page can reach here before the page_ext arrays are
@@ -208,7 +205,6 @@ struct page_ext *lookup_page_ext(struct page *page)
*/
if (!section->page_ext)
return NULL;
-#endif
return get_entry(section->page_ext, pfn);
}
The patch below does not apply to the 4.9-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
>From e492080e640c2d1235ddf3441cae634cfffef7e1 Mon Sep 17 00:00:00 2001
From: Jaewon Kim <jaewon31.kim(a)samsung.com>
Date: Wed, 15 Nov 2017 17:39:07 -0800
Subject: [PATCH] mm/page_ext.c: check if page_ext is not prepared
online_page_ext() and page_ext_init() allocate page_ext for each
section, but they do not allocate if the first PFN is !pfn_present(pfn)
or !pfn_valid(pfn). Then section->page_ext remains as NULL.
lookup_page_ext checks NULL only if CONFIG_DEBUG_VM is enabled. For a
valid PFN, __set_page_owner will try to get page_ext through
lookup_page_ext. Without CONFIG_DEBUG_VM lookup_page_ext will misuse
NULL pointer as value 0. This incurrs invalid address access.
This is the panic example when PFN 0x100000 is not valid but PFN
0x13FC00 is being used for page_ext. section->page_ext is NULL,
get_entry returned invalid page_ext address as 0x1DFA000 for a PFN
0x13FC00.
To avoid this panic, CONFIG_DEBUG_VM should be removed so that page_ext
will be checked at all times.
Unable to handle kernel paging request at virtual address 01dfa014
------------[ cut here ]------------
Kernel BUG at ffffff80082371e0 [verbose debug info unavailable]
Internal error: Oops: 96000045 [#1] PREEMPT SMP
Modules linked in:
PC is at __set_page_owner+0x48/0x78
LR is at __set_page_owner+0x44/0x78
__set_page_owner+0x48/0x78
get_page_from_freelist+0x880/0x8e8
__alloc_pages_nodemask+0x14c/0xc48
__do_page_cache_readahead+0xdc/0x264
filemap_fault+0x2ac/0x550
ext4_filemap_fault+0x3c/0x58
__do_fault+0x80/0x120
handle_mm_fault+0x704/0xbb0
do_page_fault+0x2e8/0x394
do_mem_abort+0x88/0x124
Pre-4.7 kernels also need commit f86e4271978b ("mm: check the return
value of lookup_page_ext for all call sites").
Link: http://lkml.kernel.org/r/20171107094131.14621-1-jaewon31.kim@samsung.com
Fixes: eefa864b701d ("mm/page_ext: resurrect struct page extending code for debugging")
Signed-off-by: Jaewon Kim <jaewon31.kim(a)samsung.com>
Acked-by: Michal Hocko <mhocko(a)suse.com>
Cc: Vlastimil Babka <vbabka(a)suse.cz>
Cc: Minchan Kim <minchan(a)kernel.org>
Cc: Joonsoo Kim <js1304(a)gmail.com>
Cc: <stable(a)vger.kernel.org> [depends on f86e427197, see above]
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds(a)linux-foundation.org>
diff --git a/mm/page_ext.c b/mm/page_ext.c
index 4f0367d472c4..2c16216c29b6 100644
--- a/mm/page_ext.c
+++ b/mm/page_ext.c
@@ -125,7 +125,6 @@ struct page_ext *lookup_page_ext(struct page *page)
struct page_ext *base;
base = NODE_DATA(page_to_nid(page))->node_page_ext;
-#if defined(CONFIG_DEBUG_VM)
/*
* The sanity checks the page allocator does upon freeing a
* page can reach here before the page_ext arrays are
@@ -134,7 +133,6 @@ struct page_ext *lookup_page_ext(struct page *page)
*/
if (unlikely(!base))
return NULL;
-#endif
index = pfn - round_down(node_start_pfn(page_to_nid(page)),
MAX_ORDER_NR_PAGES);
return get_entry(base, index);
@@ -199,7 +197,6 @@ struct page_ext *lookup_page_ext(struct page *page)
{
unsigned long pfn = page_to_pfn(page);
struct mem_section *section = __pfn_to_section(pfn);
-#if defined(CONFIG_DEBUG_VM)
/*
* The sanity checks the page allocator does upon freeing a
* page can reach here before the page_ext arrays are
@@ -208,7 +205,6 @@ struct page_ext *lookup_page_ext(struct page *page)
*/
if (!section->page_ext)
return NULL;
-#endif
return get_entry(section->page_ext, pfn);
}
From: Brian King <brking(a)linux.vnet.ibm.com>
The original issue being fixed in this patch was seen with the ixgbe
driver, but the same issue exists with i40evf as well, as the code is
very similar. read_barrier_depends is not sufficient to ensure
loads following it are not speculatively loaded out of order
by the CPU, which can result in stale data being loaded, causing
potential system crashes.
Cc: stable <stable(a)vger.kernel.org>
Signed-off-by: Brian King <brking(a)linux.vnet.ibm.com>
Acked-by: Jesse Brandeburg <jesse.brandeburg(a)intel.com>
Tested-by: Andrew Bowers <andrewx.bowers(a)intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher(a)intel.com>
---
drivers/net/ethernet/intel/i40evf/i40e_txrx.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/drivers/net/ethernet/intel/i40evf/i40e_txrx.c b/drivers/net/ethernet/intel/i40evf/i40e_txrx.c
index fe817e2b6fef..50864f99446d 100644
--- a/drivers/net/ethernet/intel/i40evf/i40e_txrx.c
+++ b/drivers/net/ethernet/intel/i40evf/i40e_txrx.c
@@ -179,7 +179,7 @@ static bool i40e_clean_tx_irq(struct i40e_vsi *vsi,
break;
/* prevent any other reads prior to eop_desc */
- read_barrier_depends();
+ smp_rmb();
i40e_trace(clean_tx_irq, tx_ring, tx_desc, tx_buf);
/* if the descriptor isn't done, no work yet to do */
--
2.15.0
From: Brian King <brking(a)linux.vnet.ibm.com>
The original issue being fixed in this patch was seen with the ixgbe
driver, but the same issue exists with fm10k as well, as the code is
very similar. read_barrier_depends is not sufficient to ensure
loads following it are not speculatively loaded out of order
by the CPU, which can result in stale data being loaded, causing
potential system crashes.
Cc: stable <stable(a)vger.kernel.org>
Signed-off-by: Brian King <brking(a)linux.vnet.ibm.com>
Acked-by: Jesse Brandeburg <jesse.brandeburg(a)intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher(a)intel.com>
---
drivers/net/ethernet/intel/fm10k/fm10k_main.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/drivers/net/ethernet/intel/fm10k/fm10k_main.c b/drivers/net/ethernet/intel/fm10k/fm10k_main.c
index dbd69310f263..538b42d5c187 100644
--- a/drivers/net/ethernet/intel/fm10k/fm10k_main.c
+++ b/drivers/net/ethernet/intel/fm10k/fm10k_main.c
@@ -1231,7 +1231,7 @@ static bool fm10k_clean_tx_irq(struct fm10k_q_vector *q_vector,
break;
/* prevent any other reads prior to eop_desc */
- read_barrier_depends();
+ smp_rmb();
/* if DD is not set pending work has not been completed */
if (!(eop_desc->flags & FM10K_TXD_FLAG_DONE))
--
2.15.0
From: Brian King <brking(a)linux.vnet.ibm.com>
The original issue being fixed in this patch was seen with the ixgbe
driver, but the same issue exists with igb as well, as the code is
very similar. read_barrier_depends is not sufficient to ensure
loads following it are not speculatively loaded out of order
by the CPU, which can result in stale data being loaded, causing
potential system crashes.
Cc: stable <stable(a)vger.kernel.org>
Signed-off-by: Brian King <brking(a)linux.vnet.ibm.com>
Acked-by: Jesse Brandeburg <jesse.brandeburg(a)intel.com>
Tested-by: Aaron Brown <aaron.f.brown(a)intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher(a)intel.com>
---
drivers/net/ethernet/intel/igb/igb_main.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/drivers/net/ethernet/intel/igb/igb_main.c b/drivers/net/ethernet/intel/igb/igb_main.c
index e94d3c256667..c208753ff5b7 100644
--- a/drivers/net/ethernet/intel/igb/igb_main.c
+++ b/drivers/net/ethernet/intel/igb/igb_main.c
@@ -7317,7 +7317,7 @@ static bool igb_clean_tx_irq(struct igb_q_vector *q_vector, int napi_budget)
break;
/* prevent any other reads prior to eop_desc */
- read_barrier_depends();
+ smp_rmb();
/* if DD is not set pending work has not been completed */
if (!(eop_desc->wb.status & cpu_to_le32(E1000_TXD_STAT_DD)))
--
2.15.0
From: Brian King <brking(a)linux.vnet.ibm.com>
The original issue being fixed in this patch was seen with the ixgbe
driver, but the same issue exists with igbvf as well, as the code is
very similar. read_barrier_depends is not sufficient to ensure
loads following it are not speculatively loaded out of order
by the CPU, which can result in stale data being loaded, causing
potential system crashes.
Cc: stable <stable(a)vger.kernel.org>
Signed-off-by: Brian King <brking(a)linux.vnet.ibm.com>
Acked-by: Jesse Brandeburg <jesse.brandeburg(a)intel.com>
Tested-by: Aaron Brown <aaron.f.brown(a)intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher(a)intel.com>
---
drivers/net/ethernet/intel/igbvf/netdev.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/drivers/net/ethernet/intel/igbvf/netdev.c b/drivers/net/ethernet/intel/igbvf/netdev.c
index 713e8df23744..4214c1519a87 100644
--- a/drivers/net/ethernet/intel/igbvf/netdev.c
+++ b/drivers/net/ethernet/intel/igbvf/netdev.c
@@ -810,7 +810,7 @@ static bool igbvf_clean_tx_irq(struct igbvf_ring *tx_ring)
break;
/* prevent any other reads prior to eop_desc */
- read_barrier_depends();
+ smp_rmb();
/* if DD is not set pending work has not been completed */
if (!(eop_desc->wb.status & cpu_to_le32(E1000_TXD_STAT_DD)))
--
2.15.0
From: Brian King <brking(a)linux.vnet.ibm.com>
The original issue being fixed in this patch was seen with the ixgbe
driver, but the same issue exists with ixgbevf as well, as the code is
very similar. read_barrier_depends is not sufficient to ensure
loads following it are not speculatively loaded out of order
by the CPU, which can result in stale data being loaded, causing
potential system crashes.
Cc: stable <stable(a)vger.kernel.org>
Signed-off-by: Brian King <brking(a)linux.vnet.ibm.com>
Acked-by: Jesse Brandeburg <jesse.brandeburg(a)intel.com>
Tested-by: Andrew Bowers <andrewx.bowers(a)intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher(a)intel.com>
---
drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c b/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c
index feed11bc9ddf..1f4a69134ade 100644
--- a/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c
+++ b/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c
@@ -326,7 +326,7 @@ static bool ixgbevf_clean_tx_irq(struct ixgbevf_q_vector *q_vector,
break;
/* prevent any other reads prior to eop_desc */
- read_barrier_depends();
+ smp_rmb();
/* if DD is not set pending work has not been completed */
if (!(eop_desc->wb.status & cpu_to_le32(IXGBE_TXD_STAT_DD)))
--
2.15.0
From: Brian King <brking(a)linux.vnet.ibm.com>
The original issue being fixed in this patch was seen with the ixgbe
driver, but the same issue exists with i40e as well, as the code is
very similar. read_barrier_depends is not sufficient to ensure
loads following it are not speculatively loaded out of order
by the CPU, which can result in stale data being loaded, causing
potential system crashes.
Cc: stable <stable(a)vger.kernel.org>
Signed-off-by: Brian King <brking(a)linux.vnet.ibm.com>
Acked-by: Jesse Brandeburg <jesse.brandeburg(a)intel.com>
Tested-by: Andrew Bowers <andrewx.bowers(a)intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher(a)intel.com>
---
drivers/net/ethernet/intel/i40e/i40e_main.c | 2 +-
drivers/net/ethernet/intel/i40e/i40e_txrx.c | 2 +-
2 files changed, 2 insertions(+), 2 deletions(-)
diff --git a/drivers/net/ethernet/intel/i40e/i40e_main.c b/drivers/net/ethernet/intel/i40e/i40e_main.c
index 775d5a125887..4c08cc86463e 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_main.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_main.c
@@ -3966,7 +3966,7 @@ static bool i40e_clean_fdir_tx_irq(struct i40e_ring *tx_ring, int budget)
break;
/* prevent any other reads prior to eop_desc */
- read_barrier_depends();
+ smp_rmb();
/* if the descriptor isn't done, no work yet to do */
if (!(eop_desc->cmd_type_offset_bsz &
diff --git a/drivers/net/ethernet/intel/i40e/i40e_txrx.c b/drivers/net/ethernet/intel/i40e/i40e_txrx.c
index d6d352a6e6ea..4566d66ffc7c 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_txrx.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_txrx.c
@@ -759,7 +759,7 @@ static bool i40e_clean_tx_irq(struct i40e_vsi *vsi,
break;
/* prevent any other reads prior to eop_desc */
- read_barrier_depends();
+ smp_rmb();
i40e_trace(clean_tx_irq, tx_ring, tx_desc, tx_buf);
/* we have caught up to head, no work left to do */
--
2.15.0
From: Brian King <brking(a)linux.vnet.ibm.com>
This patch fixes an issue seen on Power systems with ixgbe which results
in skb list corruption and an eventual kernel oops. The following is what
was observed:
CPU 1 CPU2
============================ ============================
1: ixgbe_xmit_frame_ring ixgbe_clean_tx_irq
2: first->skb = skb eop_desc = tx_buffer->next_to_watch
3: ixgbe_tx_map read_barrier_depends()
4: wmb check adapter written status bit
5: first->next_to_watch = tx_desc napi_consume_skb(tx_buffer->skb ..);
6: writel(i, tx_ring->tail);
The read_barrier_depends is insufficient to ensure that tx_buffer->skb does not
get loaded prior to tx_buffer->next_to_watch, which then results in loading
a stale skb pointer. This patch replaces the read_barrier_depends with
smp_rmb to ensure loads are ordered with respect to the load of
tx_buffer->next_to_watch.
Cc: stable <stable(a)vger.kernel.org>
Signed-off-by: Brian King <brking(a)linux.vnet.ibm.com>
Acked-by: Jesse Brandeburg <jesse.brandeburg(a)intel.com>
Tested-by: Andrew Bowers <andrewx.bowers(a)intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher(a)intel.com>
---
drivers/net/ethernet/intel/ixgbe/ixgbe_main.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
index ca06c3cc2ca8..62a18914f00f 100644
--- a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
@@ -1192,7 +1192,7 @@ static bool ixgbe_clean_tx_irq(struct ixgbe_q_vector *q_vector,
break;
/* prevent any other reads prior to eop_desc */
- read_barrier_depends();
+ smp_rmb();
/* if DD is not set pending work has not been completed */
if (!(eop_desc->wb.status & cpu_to_le32(IXGBE_TXD_STAT_DD)))
--
2.15.0
The patch below does not apply to the 4.4-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
>From 373c4557d2aa362702c4c2d41288fb1e54990b7c Mon Sep 17 00:00:00 2001
From: Jann Horn <jannh(a)google.com>
Date: Tue, 14 Nov 2017 01:03:44 +0100
Subject: [PATCH] mm/pagewalk.c: report holes in hugetlb ranges
This matters at least for the mincore syscall, which will otherwise copy
uninitialized memory from the page allocator to userspace. It is
probably also a correctness error for /proc/$pid/pagemap, but I haven't
tested that.
Removing the `walk->hugetlb_entry` condition in walk_hugetlb_range() has
no effect because the caller already checks for that.
This only reports holes in hugetlb ranges to callers who have specified
a hugetlb_entry callback.
This issue was found using an AFL-based fuzzer.
v2:
- don't crash on ->pte_hole==NULL (Andrew Morton)
- add Cc stable (Andrew Morton)
Fixes: 1e25a271c8ac ("mincore: apply page table walker on do_mincore()")
Signed-off-by: Jann Horn <jannh(a)google.com>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Linus Torvalds <torvalds(a)linux-foundation.org>
diff --git a/mm/pagewalk.c b/mm/pagewalk.c
index 8bd4afa83cb8..23a3e415ac2c 100644
--- a/mm/pagewalk.c
+++ b/mm/pagewalk.c
@@ -188,8 +188,12 @@ static int walk_hugetlb_range(unsigned long addr, unsigned long end,
do {
next = hugetlb_entry_end(h, addr, end);
pte = huge_pte_offset(walk->mm, addr & hmask, sz);
- if (pte && walk->hugetlb_entry)
+
+ if (pte)
err = walk->hugetlb_entry(pte, hmask, addr, next, walk);
+ else if (walk->pte_hole)
+ err = walk->pte_hole(addr, next, walk);
+
if (err)
break;
} while (addr = next, addr != end);
The patch below does not apply to the 4.9-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
>From 373c4557d2aa362702c4c2d41288fb1e54990b7c Mon Sep 17 00:00:00 2001
From: Jann Horn <jannh(a)google.com>
Date: Tue, 14 Nov 2017 01:03:44 +0100
Subject: [PATCH] mm/pagewalk.c: report holes in hugetlb ranges
This matters at least for the mincore syscall, which will otherwise copy
uninitialized memory from the page allocator to userspace. It is
probably also a correctness error for /proc/$pid/pagemap, but I haven't
tested that.
Removing the `walk->hugetlb_entry` condition in walk_hugetlb_range() has
no effect because the caller already checks for that.
This only reports holes in hugetlb ranges to callers who have specified
a hugetlb_entry callback.
This issue was found using an AFL-based fuzzer.
v2:
- don't crash on ->pte_hole==NULL (Andrew Morton)
- add Cc stable (Andrew Morton)
Fixes: 1e25a271c8ac ("mincore: apply page table walker on do_mincore()")
Signed-off-by: Jann Horn <jannh(a)google.com>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Linus Torvalds <torvalds(a)linux-foundation.org>
diff --git a/mm/pagewalk.c b/mm/pagewalk.c
index 8bd4afa83cb8..23a3e415ac2c 100644
--- a/mm/pagewalk.c
+++ b/mm/pagewalk.c
@@ -188,8 +188,12 @@ static int walk_hugetlb_range(unsigned long addr, unsigned long end,
do {
next = hugetlb_entry_end(h, addr, end);
pte = huge_pte_offset(walk->mm, addr & hmask, sz);
- if (pte && walk->hugetlb_entry)
+
+ if (pte)
err = walk->hugetlb_entry(pte, hmask, addr, next, walk);
+ else if (walk->pte_hole)
+ err = walk->pte_hole(addr, next, walk);
+
if (err)
break;
} while (addr = next, addr != end);
When I added entry_SYSCALL_64_after_hwframe, I left TRACE_IRQS_OFF
before it. This means that users of entry_SYSCALL_64_after_hwframe
were responsible for invoking TRACE_IRQS_OFF, and the one and only
user (added in the same commit) got it wrong.
I think this would manifest as a warning if a Xen PV guest with
CONFIG_DEBUG_LOCKDEP=y were used with context tracking. (The
context tracking bit is to cause lockdep to get invoked before we
turn IRQs back on.) I haven't tested that for real yet because I
can't get a kernel configured like that to boot at all on Xen PV.
I've reported it upstream. The problem seems to be that Xen PV is
missing early #UD handling, is hitting some WARN, and we rely on
Move TRACE_IRQS_OFF below the label.
Cc: stable(a)vger.kernel.org
Cc: Boris Ostrovsky <boris.ostrovsky(a)oracle.com>
Cc: Juergen Gross <jgross(a)suse.com>
Fixes: 8a9949bc71a7 ("x86/xen/64: Rearrange the SYSCALL entries")
Signed-off-by: Andy Lutomirski <luto(a)kernel.org>
---
arch/x86/entry/entry_64.S | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
index a2b30ec69497..5063ed1214dd 100644
--- a/arch/x86/entry/entry_64.S
+++ b/arch/x86/entry/entry_64.S
@@ -148,8 +148,6 @@ ENTRY(entry_SYSCALL_64)
movq %rsp, PER_CPU_VAR(rsp_scratch)
movq PER_CPU_VAR(cpu_current_top_of_stack), %rsp
- TRACE_IRQS_OFF
-
/* Construct struct pt_regs on stack */
pushq $__USER_DS /* pt_regs->ss */
pushq PER_CPU_VAR(rsp_scratch) /* pt_regs->sp */
@@ -170,6 +168,8 @@ GLOBAL(entry_SYSCALL_64_after_hwframe)
sub $(6*8), %rsp /* pt_regs->bp, bx, r12-15 not saved */
UNWIND_HINT_REGS extra=0
+ TRACE_IRQS_OFF
+
/*
* If we need to do entry work or if we guess we'll need to do
* exit work, go straight to the slow path.
--
2.13.6
This patch converts several network drivers to use smp_rmb
rather than read_barrier_depends. The initial issue was
discovered with ixgbe on a Power machine which resulted
in skb list corruption due to fetching a stale skb pointer.
More details can be found in the ixgbe patch description.
Changes since v1:
- Remove NULLing of tx_buffer->skb in the ixgbe patch
Brian King (7):
ixgbe: Fix skb list corruption on Power systems
i40e: Use smp_rmb rather than read_barrier_depends
ixgbevf: Use smp_rmb rather than read_barrier_depends
igbvf: Use smp_rmb rather than read_barrier_depends
igb: Use smp_rmb rather than read_barrier_depends
fm10k: Use smp_rmb rather than read_barrier_depends
i40evf: Use smp_rmb rather than read_barrier_depends
drivers/net/ethernet/intel/fm10k/fm10k_main.c | 2 +-
drivers/net/ethernet/intel/i40e/i40e_main.c | 2 +-
drivers/net/ethernet/intel/i40e/i40e_txrx.c | 2 +-
drivers/net/ethernet/intel/i40evf/i40e_txrx.c | 2 +-
drivers/net/ethernet/intel/igb/igb_main.c | 2 +-
drivers/net/ethernet/intel/igbvf/netdev.c | 2 +-
drivers/net/ethernet/intel/ixgbe/ixgbe_main.c | 2 +-
drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c | 2 +-
8 files changed, 8 insertions(+), 8 deletions(-)
--
1.8.3.1
The patch titled
Subject: IB/core: disable memory registration of fileystem-dax vmas
has been added to the -mm tree. Its filename is
ib-core-disable-memory-registration-of-fileystem-dax-vmas.patch
This patch should soon appear at
http://ozlabs.org/~akpm/mmots/broken-out/ib-core-disable-memory-registratio…
and later at
http://ozlabs.org/~akpm/mmotm/broken-out/ib-core-disable-memory-registratio…
Before you just go and hit "reply", please:
a) Consider who else should be cc'ed
b) Prefer to cc a suitable mailing list as well
c) Ideally: find the original patch on the mailing list and do a
reply-to-all to that, adding suitable additional cc's
*** Remember to use Documentation/SubmitChecklist when testing your code ***
The -mm tree is included into linux-next and is updated
there every 3-4 working days
------------------------------------------------------
From: Dan Williams <dan.j.williams(a)intel.com>
Subject: IB/core: disable memory registration of fileystem-dax vmas
Until there is a solution to the dma-to-dax vs truncate problem it is not
safe to allow RDMA to create long standing memory registrations against
filesytem-dax vmas.
Link: http://lkml.kernel.org/r/151068941011.7446.7766030590347262502.stgit@dwilli…
Fixes: 3565fce3a659 ("mm, x86: get_user_pages() for dax mappings")
Signed-off-by: Dan Williams <dan.j.williams(a)intel.com>
Reported-by: Christoph Hellwig <hch(a)lst.de>
Reviewed-by: Christoph Hellwig <hch(a)lst.de>
Cc: Sean Hefty <sean.hefty(a)intel.com>
Cc: Doug Ledford <dledford(a)redhat.com>
Cc: Hal Rosenstock <hal.rosenstock(a)gmail.com>
Cc: Jeff Moyer <jmoyer(a)redhat.com>
Cc: Ross Zwisler <ross.zwisler(a)linux.intel.com>
Cc: Jason Gunthorpe <jgunthorpe(a)obsidianresearch.com>
Cc: Inki Dae <inki.dae(a)samsung.com>
Cc: Jan Kara <jack(a)suse.cz>
Cc: Joonyoung Shim <jy0922.shim(a)samsung.com>
Cc: Kyungmin Park <kyungmin.park(a)samsung.com>
Cc: Mauro Carvalho Chehab <mchehab(a)kernel.org>
Cc: Mel Gorman <mgorman(a)suse.de>
Cc: Seung-Woo Kim <sw0312.kim(a)samsung.com>
Cc: Vlastimil Babka <vbabka(a)suse.cz>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
drivers/infiniband/core/umem.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff -puN drivers/infiniband/core/umem.c~ib-core-disable-memory-registration-of-fileystem-dax-vmas drivers/infiniband/core/umem.c
--- a/drivers/infiniband/core/umem.c~ib-core-disable-memory-registration-of-fileystem-dax-vmas
+++ a/drivers/infiniband/core/umem.c
@@ -191,7 +191,7 @@ struct ib_umem *ib_umem_get(struct ib_uc
sg_list_start = umem->sg_head.sgl;
while (npages) {
- ret = get_user_pages(cur_base,
+ ret = get_user_pages_longterm(cur_base,
min_t(unsigned long, npages,
PAGE_SIZE / sizeof (struct page *)),
gup_flags, page_list, vma_list);
_
Patches currently in -mm which might be from dan.j.williams(a)intel.com are
mm-fix-device-dax-pud-write-faults-triggered-by-get_user_pages.patch
mm-fix-device-dax-pud-write-faults-triggered-by-get_user_pages-v3.patch
mm-switch-to-define-pmd_write-instead-of-__have_arch_pmd_write.patch
mm-replace-pud_write-with-pud_access_permitted-in-fault-gup-paths.patch
mm-replace-pud_write-with-pud_access_permitted-in-fault-gup-paths-v3.patch
mm-replace-pmd_write-with-pmd_access_permitted-in-fault-gup-paths.patch
mm-replace-pte_write-with-pte_access_permitted-in-fault-gup-paths.patch
mm-hugetlbfs-introduce-split-to-vm_operations_struct.patch
device-dax-implement-split-to-catch-invalid-munmap-attempts.patch
mm-introduce-get_user_pages_longterm.patch
mm-fail-get_vaddr_frames-for-filesystem-dax-mappings.patch
v4l2-disable-filesystem-dax-mapping-support.patch
ib-core-disable-memory-registration-of-fileystem-dax-vmas.patch
The patch titled
Subject: v4l2: disable filesystem-dax mapping support
has been added to the -mm tree. Its filename is
v4l2-disable-filesystem-dax-mapping-support.patch
This patch should soon appear at
http://ozlabs.org/~akpm/mmots/broken-out/v4l2-disable-filesystem-dax-mappin…
and later at
http://ozlabs.org/~akpm/mmotm/broken-out/v4l2-disable-filesystem-dax-mappin…
Before you just go and hit "reply", please:
a) Consider who else should be cc'ed
b) Prefer to cc a suitable mailing list as well
c) Ideally: find the original patch on the mailing list and do a
reply-to-all to that, adding suitable additional cc's
*** Remember to use Documentation/SubmitChecklist when testing your code ***
The -mm tree is included into linux-next and is updated
there every 3-4 working days
------------------------------------------------------
From: Dan Williams <dan.j.williams(a)intel.com>
Subject: v4l2: disable filesystem-dax mapping support
V4L2 memory registrations are incompatible with filesystem-dax that needs
the ability to revoke dma access to a mapping at will, or otherwise allow
the kernel to wait for completion of DMA. The filesystem-dax
implementation breaks the traditional solution of truncate of active file
backed mappings since there is no page-cache page we can orphan to sustain
ongoing DMA.
If v4l2 wants to support long lived DMA mappings it needs to arrange to
hold a file lease or use some other mechanism so that the kernel can
coordinate revoking DMA access when the filesystem needs to truncate
mappings.
Link: http://lkml.kernel.org/r/151068940499.7446.12846708245365671207.stgit@dwill…
Fixes: 3565fce3a659 ("mm, x86: get_user_pages() for dax mappings")
Signed-off-by: Dan Williams <dan.j.williams(a)intel.com>
Reported-by: Jan Kara <jack(a)suse.cz>
Cc: Mauro Carvalho Chehab <mchehab(a)kernel.org>
Cc: Christoph Hellwig <hch(a)lst.de>
Cc: Doug Ledford <dledford(a)redhat.com>
Cc: Hal Rosenstock <hal.rosenstock(a)gmail.com>
Cc: Inki Dae <inki.dae(a)samsung.com>
Cc: Jason Gunthorpe <jgunthorpe(a)obsidianresearch.com>
Cc: Jeff Moyer <jmoyer(a)redhat.com>
Cc: Joonyoung Shim <jy0922.shim(a)samsung.com>
Cc: Kyungmin Park <kyungmin.park(a)samsung.com>
Cc: Mel Gorman <mgorman(a)suse.de>
Cc: Ross Zwisler <ross.zwisler(a)linux.intel.com>
Cc: Sean Hefty <sean.hefty(a)intel.com>
Cc: Seung-Woo Kim <sw0312.kim(a)samsung.com>
Cc: Vlastimil Babka <vbabka(a)suse.cz>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
drivers/media/v4l2-core/videobuf-dma-sg.c | 5 +++--
1 file changed, 3 insertions(+), 2 deletions(-)
diff -puN drivers/media/v4l2-core/videobuf-dma-sg.c~v4l2-disable-filesystem-dax-mapping-support drivers/media/v4l2-core/videobuf-dma-sg.c
--- a/drivers/media/v4l2-core/videobuf-dma-sg.c~v4l2-disable-filesystem-dax-mapping-support
+++ a/drivers/media/v4l2-core/videobuf-dma-sg.c
@@ -185,12 +185,13 @@ static int videobuf_dma_init_user_locked
dprintk(1, "init user [0x%lx+0x%lx => %d pages]\n",
data, size, dma->nr_pages);
- err = get_user_pages(data & PAGE_MASK, dma->nr_pages,
+ err = get_user_pages_longterm(data & PAGE_MASK, dma->nr_pages,
flags, dma->pages, NULL);
if (err != dma->nr_pages) {
dma->nr_pages = (err >= 0) ? err : 0;
- dprintk(1, "get_user_pages: err=%d [%d]\n", err, dma->nr_pages);
+ dprintk(1, "get_user_pages_longterm: err=%d [%d]\n", err,
+ dma->nr_pages);
return err < 0 ? err : -EINVAL;
}
return 0;
_
Patches currently in -mm which might be from dan.j.williams(a)intel.com are
mm-fix-device-dax-pud-write-faults-triggered-by-get_user_pages.patch
mm-fix-device-dax-pud-write-faults-triggered-by-get_user_pages-v3.patch
mm-switch-to-define-pmd_write-instead-of-__have_arch_pmd_write.patch
mm-replace-pud_write-with-pud_access_permitted-in-fault-gup-paths.patch
mm-replace-pud_write-with-pud_access_permitted-in-fault-gup-paths-v3.patch
mm-replace-pmd_write-with-pmd_access_permitted-in-fault-gup-paths.patch
mm-replace-pte_write-with-pte_access_permitted-in-fault-gup-paths.patch
mm-hugetlbfs-introduce-split-to-vm_operations_struct.patch
device-dax-implement-split-to-catch-invalid-munmap-attempts.patch
mm-introduce-get_user_pages_longterm.patch
mm-fail-get_vaddr_frames-for-filesystem-dax-mappings.patch
v4l2-disable-filesystem-dax-mapping-support.patch
ib-core-disable-memory-registration-of-fileystem-dax-vmas.patch
The patch titled
Subject: mm: fail get_vaddr_frames() for filesystem-dax mappings
has been added to the -mm tree. Its filename is
mm-fail-get_vaddr_frames-for-filesystem-dax-mappings.patch
This patch should soon appear at
http://ozlabs.org/~akpm/mmots/broken-out/mm-fail-get_vaddr_frames-for-files…
and later at
http://ozlabs.org/~akpm/mmotm/broken-out/mm-fail-get_vaddr_frames-for-files…
Before you just go and hit "reply", please:
a) Consider who else should be cc'ed
b) Prefer to cc a suitable mailing list as well
c) Ideally: find the original patch on the mailing list and do a
reply-to-all to that, adding suitable additional cc's
*** Remember to use Documentation/SubmitChecklist when testing your code ***
The -mm tree is included into linux-next and is updated
there every 3-4 working days
------------------------------------------------------
From: Dan Williams <dan.j.williams(a)intel.com>
Subject: mm: fail get_vaddr_frames() for filesystem-dax mappings
Until there is a solution to the dma-to-dax vs truncate problem it is not
safe to allow V4L2, Exynos, and other frame vector users to create long
standing / irrevocable memory registrations against filesytem-dax vmas.
Link: http://lkml.kernel.org/r/151068939985.7446.15684639617389154187.stgit@dwill…
Fixes: 3565fce3a659 ("mm, x86: get_user_pages() for dax mappings")
Signed-off-by: Dan Williams <dan.j.williams(a)intel.com>
Cc: Inki Dae <inki.dae(a)samsung.com>
Cc: Seung-Woo Kim <sw0312.kim(a)samsung.com>
Cc: Joonyoung Shim <jy0922.shim(a)samsung.com>
Cc: Kyungmin Park <kyungmin.park(a)samsung.com>
Cc: Mauro Carvalho Chehab <mchehab(a)kernel.org>
Cc: Jan Kara <jack(a)suse.cz>
Cc: Mel Gorman <mgorman(a)suse.de>
Cc: Vlastimil Babka <vbabka(a)suse.cz>
Cc: Christoph Hellwig <hch(a)lst.de>
Cc: Doug Ledford <dledford(a)redhat.com>
Cc: Hal Rosenstock <hal.rosenstock(a)gmail.com>
Cc: Jason Gunthorpe <jgunthorpe(a)obsidianresearch.com>
Cc: Jeff Moyer <jmoyer(a)redhat.com>
Cc: Ross Zwisler <ross.zwisler(a)linux.intel.com>
Cc: Sean Hefty <sean.hefty(a)intel.com>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
mm/frame_vector.c | 4 ++++
1 file changed, 4 insertions(+)
diff -puN mm/frame_vector.c~mm-fail-get_vaddr_frames-for-filesystem-dax-mappings mm/frame_vector.c
--- a/mm/frame_vector.c~mm-fail-get_vaddr_frames-for-filesystem-dax-mappings
+++ a/mm/frame_vector.c
@@ -53,6 +53,10 @@ int get_vaddr_frames(unsigned long start
ret = -EFAULT;
goto out;
}
+
+ if (vma_is_fsdax(vma))
+ return -EOPNOTSUPP;
+
if (!(vma->vm_flags & (VM_IO | VM_PFNMAP))) {
vec->got_ref = true;
vec->is_pfns = false;
_
Patches currently in -mm which might be from dan.j.williams(a)intel.com are
mm-fix-device-dax-pud-write-faults-triggered-by-get_user_pages.patch
mm-fix-device-dax-pud-write-faults-triggered-by-get_user_pages-v3.patch
mm-switch-to-define-pmd_write-instead-of-__have_arch_pmd_write.patch
mm-replace-pud_write-with-pud_access_permitted-in-fault-gup-paths.patch
mm-replace-pud_write-with-pud_access_permitted-in-fault-gup-paths-v3.patch
mm-replace-pmd_write-with-pmd_access_permitted-in-fault-gup-paths.patch
mm-replace-pte_write-with-pte_access_permitted-in-fault-gup-paths.patch
mm-hugetlbfs-introduce-split-to-vm_operations_struct.patch
device-dax-implement-split-to-catch-invalid-munmap-attempts.patch
mm-introduce-get_user_pages_longterm.patch
mm-fail-get_vaddr_frames-for-filesystem-dax-mappings.patch
v4l2-disable-filesystem-dax-mapping-support.patch
ib-core-disable-memory-registration-of-fileystem-dax-vmas.patch
The patch titled
Subject: mm: introduce get_user_pages_longterm
has been added to the -mm tree. Its filename is
mm-introduce-get_user_pages_longterm.patch
This patch should soon appear at
http://ozlabs.org/~akpm/mmots/broken-out/mm-introduce-get_user_pages_longte…
and later at
http://ozlabs.org/~akpm/mmotm/broken-out/mm-introduce-get_user_pages_longte…
Before you just go and hit "reply", please:
a) Consider who else should be cc'ed
b) Prefer to cc a suitable mailing list as well
c) Ideally: find the original patch on the mailing list and do a
reply-to-all to that, adding suitable additional cc's
*** Remember to use Documentation/SubmitChecklist when testing your code ***
The -mm tree is included into linux-next and is updated
there every 3-4 working days
------------------------------------------------------
From: Dan Williams <dan.j.williams(a)intel.com>
Subject: mm: introduce get_user_pages_longterm
Patch series "introduce get_user_pages_longterm()", v2.
Here is a new get_user_pages api for cases where a driver intends to keep
an elevated page count indefinitely. This is distinct from usages like
iov_iter_get_pages where the elevated page counts are transient. The
iov_iter_get_pages cases immediately turn around and submit the pages to a
device driver which will put_page when the i/o operation completes (under
kernel control).
In the longterm case userspace is responsible for dropping the page
reference at some undefined point in the future. This is untenable for
filesystem-dax case where the filesystem is in control of the lifetime of
the block / page and needs reasonable limits on how long it can wait for
pages in a mapping to become idle.
Fixing filesystems to actually wait for dax pages to be idle before blocks
from a truncate/hole-punch operation are repurposed is saved for a later
patch series.
Also, allowing longterm registration of dax mappings is a future patch
series that introduces a "map with lease" semantic where the kernel can
revoke a lease and force userspace to drop its page references.
I have also tagged these for -stable to purposely break cases that might
assume that longterm memory registrations for filesystem-dax mappings were
supported by the kernel. The behavior regression this policy change
implies is one of the reasons we maintain the "dax enabled. Warning:
EXPERIMENTAL, use at your own risk" notification when mounting a
filesystem in dax mode.
It is worth noting the device-dax interface does not suffer the same
constraints since it does not support file space management operations
like hole-punch.
This patch (of 4):
Until there is a solution to the dma-to-dax vs truncate problem it is not
safe to allow long standing memory registrations against filesytem-dax
vmas. Device-dax vmas do not have this problem and are explicitly
allowed.
This is temporary until a "memory registration with layout-lease"
mechanism can be implemented for the affected sub-systems (RDMA and V4L2).
Link: http://lkml.kernel.org/r/151068939435.7446.13560129395419350737.stgit@dwill…
Fixes: 3565fce3a659 ("mm, x86: get_user_pages() for dax mappings")
Signed-off-by: Dan Williams <dan.j.williams(a)intel.com>
Suggested-by: Christoph Hellwig <hch(a)lst.de>
Cc: Doug Ledford <dledford(a)redhat.com>
Cc: Hal Rosenstock <hal.rosenstock(a)gmail.com>
Cc: Inki Dae <inki.dae(a)samsung.com>
Cc: Jan Kara <jack(a)suse.cz>
Cc: Jason Gunthorpe <jgunthorpe(a)obsidianresearch.com>
Cc: Jeff Moyer <jmoyer(a)redhat.com>
Cc: Joonyoung Shim <jy0922.shim(a)samsung.com>
Cc: Kyungmin Park <kyungmin.park(a)samsung.com>
Cc: Mauro Carvalho Chehab <mchehab(a)kernel.org>
Cc: Mel Gorman <mgorman(a)suse.de>
Cc: Ross Zwisler <ross.zwisler(a)linux.intel.com>
Cc: Sean Hefty <sean.hefty(a)intel.com>
Cc: Seung-Woo Kim <sw0312.kim(a)samsung.com>
Cc: Vlastimil Babka <vbabka(a)suse.cz>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
include/linux/fs.h | 14 +++++++++
include/linux/mm.h | 13 ++++++++
mm/gup.c | 64 +++++++++++++++++++++++++++++++++++++++++++
3 files changed, 91 insertions(+)
diff -puN include/linux/fs.h~mm-introduce-get_user_pages_longterm include/linux/fs.h
--- a/include/linux/fs.h~mm-introduce-get_user_pages_longterm
+++ a/include/linux/fs.h
@@ -3194,6 +3194,20 @@ static inline bool vma_is_dax(struct vm_
return vma->vm_file && IS_DAX(vma->vm_file->f_mapping->host);
}
+static inline bool vma_is_fsdax(struct vm_area_struct *vma)
+{
+ struct inode *inode;
+
+ if (!vma->vm_file)
+ return false;
+ if (!vma_is_dax(vma))
+ return false;
+ inode = file_inode(vma->vm_file);
+ if (inode->i_mode == S_IFCHR)
+ return false; /* device-dax */
+ return true;
+}
+
static inline int iocb_flags(struct file *file)
{
int res = 0;
diff -puN include/linux/mm.h~mm-introduce-get_user_pages_longterm include/linux/mm.h
--- a/include/linux/mm.h~mm-introduce-get_user_pages_longterm
+++ a/include/linux/mm.h
@@ -1380,6 +1380,19 @@ long get_user_pages_locked(unsigned long
unsigned int gup_flags, struct page **pages, int *locked);
long get_user_pages_unlocked(unsigned long start, unsigned long nr_pages,
struct page **pages, unsigned int gup_flags);
+#ifdef CONFIG_FS_DAX
+long get_user_pages_longterm(unsigned long start, unsigned long nr_pages,
+ unsigned int gup_flags, struct page **pages,
+ struct vm_area_struct **vmas);
+#else
+static inline long get_user_pages_longterm(unsigned long start,
+ unsigned long nr_pages, unsigned int gup_flags,
+ struct page **pages, struct vm_area_struct **vmas)
+{
+ return get_user_pages(start, nr_pages, gup_flags, pages, vmas);
+}
+#endif /* CONFIG_FS_DAX */
+
int get_user_pages_fast(unsigned long start, int nr_pages, int write,
struct page **pages);
diff -puN mm/gup.c~mm-introduce-get_user_pages_longterm mm/gup.c
--- a/mm/gup.c~mm-introduce-get_user_pages_longterm
+++ a/mm/gup.c
@@ -1095,6 +1095,70 @@ long get_user_pages(unsigned long start,
}
EXPORT_SYMBOL(get_user_pages);
+#ifdef CONFIG_FS_DAX
+/*
+ * This is the same as get_user_pages() in that it assumes we are
+ * operating on the current task's mm, but it goes further to validate
+ * that the vmas associated with the address range are suitable for
+ * longterm elevated page reference counts. For example, filesystem-dax
+ * mappings are subject to the lifetime enforced by the filesystem and
+ * we need guarantees that longterm users like RDMA and V4L2 only
+ * establish mappings that have a kernel enforced revocation mechanism.
+ *
+ * "longterm" == userspace controlled elevated page count lifetime.
+ * Contrast this to iov_iter_get_pages() usages which are transient.
+ */
+long get_user_pages_longterm(unsigned long start, unsigned long nr_pages,
+ unsigned int gup_flags, struct page **pages,
+ struct vm_area_struct **vmas_arg)
+{
+ struct vm_area_struct **vmas = vmas_arg;
+ struct vm_area_struct *vma_prev = NULL;
+ long rc, i;
+
+ if (!pages)
+ return -EINVAL;
+
+ if (!vmas) {
+ vmas = kzalloc(sizeof(struct vm_area_struct *) * nr_pages,
+ GFP_KERNEL);
+ if (!vmas)
+ return -ENOMEM;
+ }
+
+ rc = get_user_pages(start, nr_pages, gup_flags, pages, vmas);
+
+ for (i = 0; i < rc; i++) {
+ struct vm_area_struct *vma = vmas[i];
+
+ if (vma == vma_prev)
+ continue;
+
+ vma_prev = vma;
+
+ if (vma_is_fsdax(vma))
+ break;
+ }
+
+ /*
+ * Either get_user_pages() failed, or the vma validation
+ * succeeded, in either case we don't need to put_page() before
+ * returning.
+ */
+ if (i >= rc)
+ goto out;
+
+ for (i = 0; i < rc; i++)
+ put_page(pages[i]);
+ rc = -EOPNOTSUPP;
+out:
+ if (vmas != vmas_arg)
+ kfree(vmas);
+ return rc;
+}
+EXPORT_SYMBOL(get_user_pages_longterm);
+#endif /* CONFIG_FS_DAX */
+
/**
* populate_vma_page_range() - populate a range of pages in the vma.
* @vma: target vma
_
Patches currently in -mm which might be from dan.j.williams(a)intel.com are
mm-fix-device-dax-pud-write-faults-triggered-by-get_user_pages.patch
mm-fix-device-dax-pud-write-faults-triggered-by-get_user_pages-v3.patch
mm-switch-to-define-pmd_write-instead-of-__have_arch_pmd_write.patch
mm-replace-pud_write-with-pud_access_permitted-in-fault-gup-paths.patch
mm-replace-pud_write-with-pud_access_permitted-in-fault-gup-paths-v3.patch
mm-replace-pmd_write-with-pmd_access_permitted-in-fault-gup-paths.patch
mm-replace-pte_write-with-pte_access_permitted-in-fault-gup-paths.patch
mm-hugetlbfs-introduce-split-to-vm_operations_struct.patch
device-dax-implement-split-to-catch-invalid-munmap-attempts.patch
mm-introduce-get_user_pages_longterm.patch
mm-fail-get_vaddr_frames-for-filesystem-dax-mappings.patch
v4l2-disable-filesystem-dax-mapping-support.patch
ib-core-disable-memory-registration-of-fileystem-dax-vmas.patch
The patch titled
Subject: device-dax: implement ->split() to catch invalid munmap attempts
has been added to the -mm tree. Its filename is
device-dax-implement-split-to-catch-invalid-munmap-attempts.patch
This patch should soon appear at
http://ozlabs.org/~akpm/mmots/broken-out/device-dax-implement-split-to-catc…
and later at
http://ozlabs.org/~akpm/mmotm/broken-out/device-dax-implement-split-to-catc…
Before you just go and hit "reply", please:
a) Consider who else should be cc'ed
b) Prefer to cc a suitable mailing list as well
c) Ideally: find the original patch on the mailing list and do a
reply-to-all to that, adding suitable additional cc's
*** Remember to use Documentation/SubmitChecklist when testing your code ***
The -mm tree is included into linux-next and is updated
there every 3-4 working days
------------------------------------------------------
From: Dan Williams <dan.j.williams(a)intel.com>
Subject: device-dax: implement ->split() to catch invalid munmap attempts
Similar to how device-dax enforces that the 'address', 'offset', and 'len'
parameters to mmap() be aligned to the device's fundamental alignment, the
same constraints apply to munmap(). Implement ->split() to fail munmap
calls that violate the alignment constraint. Otherwise, we later fail
VM_BUG_ON checks in the unmap_page_range() path with crash signatures of
the form:
vma ffff8800b60c8a88 start 00007f88c0000000 end 00007f88c0e00000
next (null) prev (null) mm ffff8800b61150c0
prot 8000000000000027 anon_vma (null) vm_ops ffffffffa0091240
pgoff 0 file ffff8800b638ef80 private_data (null)
flags: 0x380000fb(read|write|shared|mayread|maywrite|mayexec|mayshare|softdirty|mixedmap|hugepage)
------------[ cut here ]------------
kernel BUG at mm/huge_memory.c:2014!
[..]
RIP: 0010:__split_huge_pud+0x12a/0x180
[..]
Call Trace:
unmap_page_range+0x245/0xa40
? __vma_adjust+0x301/0x990
unmap_vmas+0x4c/0xa0
unmap_region+0xae/0x120
? __vma_rb_erase+0x11a/0x230
do_munmap+0x276/0x410
vm_munmap+0x6a/0xa0
SyS_munmap+0x1d/0x30
Link: http://lkml.kernel.org/r/151130418681.4029.7118245855057952010.stgit@dwilli…
Fixes: dee410792419 ("/dev/dax, core: file operations and dax-mmap")
Signed-off-by: Dan Williams <dan.j.williams(a)intel.com>
Reported-by: Jeff Moyer <jmoyer(a)redhat.com>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
drivers/dax/device.c | 12 ++++++++++++
1 file changed, 12 insertions(+)
diff -puN drivers/dax/device.c~device-dax-implement-split-to-catch-invalid-munmap-attempts drivers/dax/device.c
--- a/drivers/dax/device.c~device-dax-implement-split-to-catch-invalid-munmap-attempts
+++ a/drivers/dax/device.c
@@ -428,9 +428,21 @@ static int dev_dax_fault(struct vm_fault
return dev_dax_huge_fault(vmf, PE_SIZE_PTE);
}
+static int dev_dax_split(struct vm_area_struct *vma, unsigned long addr)
+{
+ struct file *filp = vma->vm_file;
+ struct dev_dax *dev_dax = filp->private_data;
+ struct dax_region *dax_region = dev_dax->region;
+
+ if (!IS_ALIGNED(addr, dax_region->align))
+ return -EINVAL;
+ return 0;
+}
+
static const struct vm_operations_struct dax_vm_ops = {
.fault = dev_dax_fault,
.huge_fault = dev_dax_huge_fault,
+ .split = dev_dax_split,
};
static int dax_mmap(struct file *filp, struct vm_area_struct *vma)
_
Patches currently in -mm which might be from dan.j.williams(a)intel.com are
mm-fix-device-dax-pud-write-faults-triggered-by-get_user_pages.patch
mm-fix-device-dax-pud-write-faults-triggered-by-get_user_pages-v3.patch
mm-switch-to-define-pmd_write-instead-of-__have_arch_pmd_write.patch
mm-replace-pud_write-with-pud_access_permitted-in-fault-gup-paths.patch
mm-replace-pud_write-with-pud_access_permitted-in-fault-gup-paths-v3.patch
mm-replace-pmd_write-with-pmd_access_permitted-in-fault-gup-paths.patch
mm-replace-pte_write-with-pte_access_permitted-in-fault-gup-paths.patch
mm-hugetlbfs-introduce-split-to-vm_operations_struct.patch
device-dax-implement-split-to-catch-invalid-munmap-attempts.patch
The patch titled
Subject: mm, hugetlbfs: introduce ->split() to vm_operations_struct
has been added to the -mm tree. Its filename is
mm-hugetlbfs-introduce-split-to-vm_operations_struct.patch
This patch should soon appear at
http://ozlabs.org/~akpm/mmots/broken-out/mm-hugetlbfs-introduce-split-to-vm…
and later at
http://ozlabs.org/~akpm/mmotm/broken-out/mm-hugetlbfs-introduce-split-to-vm…
Before you just go and hit "reply", please:
a) Consider who else should be cc'ed
b) Prefer to cc a suitable mailing list as well
c) Ideally: find the original patch on the mailing list and do a
reply-to-all to that, adding suitable additional cc's
*** Remember to use Documentation/SubmitChecklist when testing your code ***
The -mm tree is included into linux-next and is updated
there every 3-4 working days
------------------------------------------------------
From: Dan Williams <dan.j.williams(a)intel.com>
Subject: mm, hugetlbfs: introduce ->split() to vm_operations_struct
Patch series "device-dax: fix unaligned munmap handling"
When device-dax is operating in huge-page mode we want it to behave like
hugetlbfs and fail attempts to split vmas into unaligned ranges. It would
be messy to teach the munmap path about device-dax alignment constraints
in the same (hstate) way that hugetlbfs communicates this constraint.
Instead, these patches introduce a new ->split() vm operation.
This patch (of 2):
The device-dax interface has similar constraints as hugetlbfs in that it
requires the munmap path to unmap in huge page aligned units. Rather than
add more custom vma handling code in __split_vma() introduce a new vm
operation to perform this vma specific check.
Link: http://lkml.kernel.org/r/151130418135.4029.6783191281930729710.stgit@dwilli…
Fixes: dee410792419 ("/dev/dax, core: file operations and dax-mmap")
Signed-off-by: Dan Williams <dan.j.williams(a)intel.com>
Cc: Jeff Moyer <jmoyer(a)redhat.com>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
include/linux/mm.h | 1 +
mm/hugetlb.c | 8 ++++++++
mm/mmap.c | 8 +++++---
3 files changed, 14 insertions(+), 3 deletions(-)
diff -puN include/linux/mm.h~mm-hugetlbfs-introduce-split-to-vm_operations_struct include/linux/mm.h
--- a/include/linux/mm.h~mm-hugetlbfs-introduce-split-to-vm_operations_struct
+++ a/include/linux/mm.h
@@ -377,6 +377,7 @@ enum page_entry_size {
struct vm_operations_struct {
void (*open)(struct vm_area_struct * area);
void (*close)(struct vm_area_struct * area);
+ int (*split)(struct vm_area_struct * area, unsigned long addr);
int (*mremap)(struct vm_area_struct * area);
int (*fault)(struct vm_fault *vmf);
int (*huge_fault)(struct vm_fault *vmf, enum page_entry_size pe_size);
diff -puN mm/hugetlb.c~mm-hugetlbfs-introduce-split-to-vm_operations_struct mm/hugetlb.c
--- a/mm/hugetlb.c~mm-hugetlbfs-introduce-split-to-vm_operations_struct
+++ a/mm/hugetlb.c
@@ -3125,6 +3125,13 @@ static void hugetlb_vm_op_close(struct v
}
}
+static int hugetlb_vm_op_split(struct vm_area_struct *vma, unsigned long addr)
+{
+ if (addr & ~(huge_page_mask(hstate_vma(vma))))
+ return -EINVAL;
+ return 0;
+}
+
/*
* We cannot handle pagefaults against hugetlb pages at all. They cause
* handle_mm_fault() to try to instantiate regular-sized pages in the
@@ -3141,6 +3148,7 @@ const struct vm_operations_struct hugetl
.fault = hugetlb_vm_op_fault,
.open = hugetlb_vm_op_open,
.close = hugetlb_vm_op_close,
+ .split = hugetlb_vm_op_split,
};
static pte_t make_huge_pte(struct vm_area_struct *vma, struct page *page,
diff -puN mm/mmap.c~mm-hugetlbfs-introduce-split-to-vm_operations_struct mm/mmap.c
--- a/mm/mmap.c~mm-hugetlbfs-introduce-split-to-vm_operations_struct
+++ a/mm/mmap.c
@@ -2555,9 +2555,11 @@ int __split_vma(struct mm_struct *mm, st
struct vm_area_struct *new;
int err;
- if (is_vm_hugetlb_page(vma) && (addr &
- ~(huge_page_mask(hstate_vma(vma)))))
- return -EINVAL;
+ if (vma->vm_ops && vma->vm_ops->split) {
+ err = vma->vm_ops->split(vma, addr);
+ if (err)
+ return err;
+ }
new = kmem_cache_alloc(vm_area_cachep, GFP_KERNEL);
if (!new)
_
Patches currently in -mm which might be from dan.j.williams(a)intel.com are
mm-fix-device-dax-pud-write-faults-triggered-by-get_user_pages.patch
mm-fix-device-dax-pud-write-faults-triggered-by-get_user_pages-v3.patch
mm-switch-to-define-pmd_write-instead-of-__have_arch_pmd_write.patch
mm-replace-pud_write-with-pud_access_permitted-in-fault-gup-paths.patch
mm-replace-pud_write-with-pud_access_permitted-in-fault-gup-paths-v3.patch
mm-replace-pmd_write-with-pmd_access_permitted-in-fault-gup-paths.patch
mm-replace-pte_write-with-pte_access_permitted-in-fault-gup-paths.patch
mm-hugetlbfs-introduce-split-to-vm_operations_struct.patch
device-dax-implement-split-to-catch-invalid-munmap-attempts.patch
Hi Andrew,
Here is another device-dax fix that requires touching some mm code. When
device-dax is operating in huge-page mode we want it to behave like
hugetlbfs and fail attempts to split vmas into unaligned ranges. It
would be messy to teach the munmap path about device-dax alignment
constraints in the same (hstate) way that hugetlbfs communicates this
constraint. Instead, these patches introduce a new ->split() vm
operation.
---
Dan Williams (2):
mm, hugetlbfs: introduce ->split() to vm_operations_struct
device-dax: implement ->split() to catch invalid munmap attempts
drivers/dax/device.c | 12 ++++++++++++
include/linux/mm.h | 1 +
mm/hugetlb.c | 8 ++++++++
mm/mmap.c | 8 +++++---
4 files changed, 26 insertions(+), 3 deletions(-)
The patch titled
Subject: mm: migrate: fix an incorrect call of prep_transhuge_page()
has been added to the -mm tree. Its filename is
mm-migrate-fix-an-incorrect-call-of-prep_transhuge_page.patch
This patch should soon appear at
http://ozlabs.org/~akpm/mmots/broken-out/mm-migrate-fix-an-incorrect-call-o…
and later at
http://ozlabs.org/~akpm/mmotm/broken-out/mm-migrate-fix-an-incorrect-call-o…
Before you just go and hit "reply", please:
a) Consider who else should be cc'ed
b) Prefer to cc a suitable mailing list as well
c) Ideally: find the original patch on the mailing list and do a
reply-to-all to that, adding suitable additional cc's
*** Remember to use Documentation/SubmitChecklist when testing your code ***
The -mm tree is included into linux-next and is updated
there every 3-4 working days
------------------------------------------------------
From: Zi Yan <zi.yan(a)cs.rutgers.edu>
Subject: mm: migrate: fix an incorrect call of prep_transhuge_page()
In https://lkml.org/lkml/2017/11/20/411, Andrea reported that during
memory hotplug/hot remove prep_transhuge_page() is called incorrectly on
non-THP pages for migration, when THP is on but THP migration is not
enabled. This leads to a bad state of target pages for migration.
This patch fixes it by only calling prep_transhuge_page() when we are
certain that the target page is THP.
Link: http://lkml.kernel.org/r/20171121021855.50525-1-zi.yan@sent.com
Fixes: 8135d8926c08 ("mm: memory_hotplug: memory hotremove supports thp migration")
Signed-off-by: Zi Yan <zi.yan(a)cs.rutgers.edu>
Reported-by: Andrea Reale <ar(a)linux.vnet.ibm.com>
Cc: Naoya Horiguchi <n-horiguchi(a)ah.jp.nec.com>
Cc: Michal Hocko <mhocko(a)kernel.org>
Cc: "Jérôme Glisse" <jglisse(a)redhat.com>
Cc: <stable(a)vger.kernel.org> [4.14]
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
include/linux/migrate.h | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff -puN include/linux/migrate.h~mm-migrate-fix-an-incorrect-call-of-prep_transhuge_page include/linux/migrate.h
--- a/include/linux/migrate.h~mm-migrate-fix-an-incorrect-call-of-prep_transhuge_page
+++ a/include/linux/migrate.h
@@ -54,7 +54,7 @@ static inline struct page *new_page_node
new_page = __alloc_pages_nodemask(gfp_mask, order,
preferred_nid, nodemask);
- if (new_page && PageTransHuge(page))
+ if (new_page && PageTransHuge(new_page))
prep_transhuge_page(new_page);
return new_page;
_
Patches currently in -mm which might be from zi.yan(a)cs.rutgers.edu are
mm-migrate-fix-an-incorrect-call-of-prep_transhuge_page.patch
From: Jeff Mahoney <jeffm(a)suse.com>
Since commit fb235dc06fa (btrfs: qgroup: Move half of the qgroup
accounting time out of commit trans) the assumption that
btrfs_add_delayed_{data,tree}_ref can only return 0 or -ENOMEM has
been false. The qgroup operations call into btrfs_search_slot
and friends and can now return the full spectrum of error codes.
Fortunately, the fix here is easy since update_ref_for_cow failing
is already handled so we just need to bail early with the error
code.
Fixes: fb235dc06fa (btrfs: qgroup: Move half of the qgroup accounting ...)
Cc: <stable(a)vger.kernel.org> # v4.11+
Signed-off-by: Jeff Mahoney <jeffm(a)suse.com>
---
fs/btrfs/ctree.c | 18 ++++++++++++------
1 file changed, 12 insertions(+), 6 deletions(-)
diff --git a/fs/btrfs/ctree.c b/fs/btrfs/ctree.c
index 531e0a8645b0..1e74cf826532 100644
--- a/fs/btrfs/ctree.c
+++ b/fs/btrfs/ctree.c
@@ -1032,14 +1032,17 @@ static noinline int update_ref_for_cow(struct btrfs_trans_handle *trans,
root->root_key.objectid == BTRFS_TREE_RELOC_OBJECTID) &&
!(flags & BTRFS_BLOCK_FLAG_FULL_BACKREF)) {
ret = btrfs_inc_ref(trans, root, buf, 1);
- BUG_ON(ret); /* -ENOMEM */
+ if (ret)
+ return ret;
if (root->root_key.objectid ==
BTRFS_TREE_RELOC_OBJECTID) {
ret = btrfs_dec_ref(trans, root, buf, 0);
- BUG_ON(ret); /* -ENOMEM */
+ if (ret)
+ return ret;
ret = btrfs_inc_ref(trans, root, cow, 1);
- BUG_ON(ret); /* -ENOMEM */
+ if (ret)
+ return ret;
}
new_flags |= BTRFS_BLOCK_FLAG_FULL_BACKREF;
} else {
@@ -1049,7 +1052,8 @@ static noinline int update_ref_for_cow(struct btrfs_trans_handle *trans,
ret = btrfs_inc_ref(trans, root, cow, 1);
else
ret = btrfs_inc_ref(trans, root, cow, 0);
- BUG_ON(ret); /* -ENOMEM */
+ if (ret)
+ return ret;
}
if (new_flags != 0) {
int level = btrfs_header_level(buf);
@@ -1068,9 +1072,11 @@ static noinline int update_ref_for_cow(struct btrfs_trans_handle *trans,
ret = btrfs_inc_ref(trans, root, cow, 1);
else
ret = btrfs_inc_ref(trans, root, cow, 0);
- BUG_ON(ret); /* -ENOMEM */
+ if (ret)
+ return ret;
ret = btrfs_dec_ref(trans, root, buf, 1);
- BUG_ON(ret); /* -ENOMEM */
+ if (ret)
+ return ret;
}
clean_tree_block(fs_info, buf);
*last_ref = 1;
--
2.14.2
Changes since v2 [1]:
* Switch from the "#define __HAVE_ARCH_PUD_WRITE" to "#define
pud_write". This incidentally fixes a powerpc compile error.
(Stephen)
* Add a cleanup patch to align pmd_write to the pud_write definition
scheme.
---
Andrew,
Here is another attempt at the pud_write() fix [2], and some follow-on
patches to use the '_access_permitted' helpers in fault and
get_user_pages() paths where we are checking if the thread has access to
write. I explicitly omit conversions for places where the kernel is
checking the _PAGE_RW flag for kernel purposes, not for userspace
access.
Beyond fixing the crash, this series also fixes get_user_pages() and
fault paths to honor protection keys in the same manner as
get_user_pages_fast(). Only the crash fix is tagged for -stable as the
protection key check is done just for consistency reasons since
userspace can change protection keys at will.
These have received a build success notification from the 0day robot.
[1]: https://lists.01.org/pipermail/linux-nvdimm/2017-November/013254.html
[2]: https://lists.01.org/pipermail/linux-nvdimm/2017-November/013237.html
---
Dan Williams (5):
mm: fix device-dax pud write-faults triggered by get_user_pages()
mm: switch to 'define pmd_write' instead of __HAVE_ARCH_PMD_WRITE
mm: replace pud_write with pud_access_permitted in fault + gup paths
mm: replace pmd_write with pmd_access_permitted in fault + gup paths
mm: replace pte_write with pte_access_permitted in fault + gup paths
arch/arm/include/asm/pgtable-3level.h | 1 -
arch/arm64/include/asm/pgtable.h | 1 -
arch/mips/include/asm/pgtable.h | 2 +-
arch/powerpc/include/asm/book3s/64/pgtable.h | 1 -
arch/s390/include/asm/pgtable.h | 8 +++++++-
arch/sparc/include/asm/pgtable_64.h | 2 +-
arch/sparc/mm/gup.c | 4 ++--
arch/tile/include/asm/pgtable.h | 1 -
arch/x86/include/asm/pgtable.h | 8 +++++++-
fs/dax.c | 3 ++-
include/asm-generic/pgtable.h | 12 ++++++++++--
include/linux/hugetlb.h | 8 --------
mm/gup.c | 2 +-
mm/hmm.c | 8 ++++----
mm/huge_memory.c | 6 +++---
mm/memory.c | 8 ++++----
16 files changed, 42 insertions(+), 33 deletions(-)
On 11/21/2017 11:48 AM, Leon Romanovsky wrote:
> On Tue, Nov 21, 2017 at 09:36:48AM -0700, Jason Gunthorpe wrote:
>> On Tue, Nov 21, 2017 at 06:34:54PM +0200, Leon Romanovsky wrote:
>>> On Tue, Nov 21, 2017 at 09:04:42AM -0700, Jason Gunthorpe wrote:
>>>> On Tue, Nov 21, 2017 at 09:37:27AM -0600, Daniel Jurgens wrote:
>>>>
>>>>> The only warning that would make sense is if the mixed ports aren't
>>>>> all IB or RoCE. As you note, CX-3 can mix those two, we don't want to
>>>>> see warnings about that.
>>>>
>>>> I would really like to see cx3 be changed to not do that, then we
>>>> could finalize this issue upstream: All device ports must be the same
>>>> protocol.
>>>
>>> I don't see the point of such artificial limitation, the users who
>>> brought CX-3 have option to work in mixed mode and IMHO it is not right
>>> to deprecate such ability just because it is hard for us to code for it.
>>
>> I don't really think it is really too user visible.. Only the device
>> and port number change, but only if running in mixed mode.
>
> Ahh, correct me if I'm wrong, you are proposing to split mlx4_ib devices
> to two devices once it is configured in mixed mode, so everyone will
> have one port only. Did I understand you correctly?
>
Which would require splitting resources that are shared now? other splitting issue(s)?
>>
>> It is not just 'hard for us' it is impossible to reconcile the
>> differences between ports when enforcing device level things.
>>
>> This keeps coming up again and again..
>>
>> Jason
On 11/21/2017 11:34 AM, Leon Romanovsky wrote:
> On Tue, Nov 21, 2017 at 09:04:42AM -0700, Jason Gunthorpe wrote:
>> On Tue, Nov 21, 2017 at 09:37:27AM -0600, Daniel Jurgens wrote:
>>
>>> The only warning that would make sense is if the mixed ports aren't
>>> all IB or RoCE. As you note, CX-3 can mix those two, we don't want to
>>> see warnings about that.
>>
>> I would really like to see cx3 be changed to not do that, then we
>> could finalize this issue upstream: All device ports must be the same
>> protocol.
>
> I don't see the point of such artificial limitation, the users who
> brought CX-3 have option to work in mixed mode and IMHO it is not right
> to deprecate such ability just because it is hard for us to code for it.
>
We use cx-3 (& cx-4 & cx-5) in mixed mode all over our RDMA cluster.
Loosing that feature is not an option.
> Thanks
>
>>
>> Jason
On Tue, Nov 21, 2017 at 06:48:02PM +0200, Leon Romanovsky wrote:
> On Tue, Nov 21, 2017 at 09:36:48AM -0700, Jason Gunthorpe wrote:
> Ahh, correct me if I'm wrong, you are proposing to split mlx4_ib devices
> to two devices once it is configured in mixed mode, so everyone will
> have one port only. Did I understand you correctly?
Right, but only for mixed mode. dual port roce or dual port ib is not
altered.
Jason
From: Adam Wallis <awallis(a)codeaurora.org>
commit a9df21e34b422f79d9a9fa5c3eff8c2a53491be6 upstream.
This patch was backported and only needed a line adjustment.
Commit adfa543e7314 ("dmatest: don't use set_freezable_with_signal()")
introduced a bug (that is in fact documented by the patch commit text)
that leaves behind a dangling pointer. Since the done_wait structure is
allocated on the stack, future invocations to the DMATEST can produce
undesirable results (e.g., corrupted spinlocks). Ideally, this would be
cleaned up in the thread handler, but at the very least, the kernel
is left in a very precarious scenario that can lead to some long debug
sessions when the crash comes later.
Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=197605
Signed-off-by: Adam Wallis <awallis(a)codeaurora.org>
Signed-off-by: Vinod Koul <vinod.koul(a)intel.com>
---
drivers/dma/dmatest.c | 1 +
1 file changed, 1 insertion(+)
diff --git a/drivers/dma/dmatest.c b/drivers/dma/dmatest.c
index cf76fc6..fbb7551 100644
--- a/drivers/dma/dmatest.c
+++ b/drivers/dma/dmatest.c
@@ -666,6 +666,7 @@ static int dmatest_func(void *data)
* free it this time?" dancing. For now, just
* leave it dangling.
*/
+ WARN(1, "dmatest: Kernel stack may be corrupted!!\n");
dmaengine_unmap_put(um);
result("test timed out", total_tests, src_off, dst_off,
len, 0);
--
Qualcomm Datacenter Technologies as an affiliate of Qualcomm Technologies, Inc.
Qualcomm Technologies, Inc. is a member of the
Code Aurora Forum, a Linux Foundation Collaborative Project.
This is a note to let you know that I've just added the patch titled
serial: omap: Fix EFR write on RTS deassertion
to the 4.9-stable tree which can be found at:
http://www.kernel.org/git/?p=linux/kernel/git/stable/stable-queue.git;a=sum…
The filename of the patch is:
serial-omap-fix-efr-write-on-rts-deassertion.patch
and it can be found in the queue-4.9 subdirectory.
If you, or anyone else, feels it should not be added to the stable tree,
please let <stable(a)vger.kernel.org> know about it.
>From 2a71de2f7366fb1aec632116d0549ec56d6a3940 Mon Sep 17 00:00:00 2001
From: Lukas Wunner <lukas(a)wunner.de>
Date: Sat, 21 Oct 2017 10:50:18 +0200
Subject: serial: omap: Fix EFR write on RTS deassertion
From: Lukas Wunner <lukas(a)wunner.de>
commit 2a71de2f7366fb1aec632116d0549ec56d6a3940 upstream.
Commit 348f9bb31c56 ("serial: omap: Fix RTS handling") sought to enable
auto RTS upon manual RTS assertion and disable it on deassertion.
However it seems the latter was done incorrectly, it clears all bits in
the Extended Features Register *except* auto RTS.
Fixes: 348f9bb31c56 ("serial: omap: Fix RTS handling")
Cc: Peter Hurley <peter(a)hurleysoftware.com>
Signed-off-by: Lukas Wunner <lukas(a)wunner.de>
Signed-off-by: Greg Kroah-Hartman <gregkh(a)linuxfoundation.org>
---
drivers/tty/serial/omap-serial.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
--- a/drivers/tty/serial/omap-serial.c
+++ b/drivers/tty/serial/omap-serial.c
@@ -693,7 +693,7 @@ static void serial_omap_set_mctrl(struct
if ((mctrl & TIOCM_RTS) && (port->status & UPSTAT_AUTORTS))
up->efr |= UART_EFR_RTS;
else
- up->efr &= UART_EFR_RTS;
+ up->efr &= ~UART_EFR_RTS;
serial_out(up, UART_EFR, up->efr);
serial_out(up, UART_LCR, lcr);
Patches currently in stable-queue which might be from lukas(a)wunner.de are
queue-4.9/serial-omap-fix-efr-write-on-rts-deassertion.patch
This is a note to let you know that I've just added the patch titled
serial: 8250_fintek: Fix finding base_port with activated SuperIO
to the 4.9-stable tree which can be found at:
http://www.kernel.org/git/?p=linux/kernel/git/stable/stable-queue.git;a=sum…
The filename of the patch is:
serial-8250_fintek-fix-finding-base_port-with-activated-superio.patch
and it can be found in the queue-4.9 subdirectory.
If you, or anyone else, feels it should not be added to the stable tree,
please let <stable(a)vger.kernel.org> know about it.
>From fd97e66c5529046e989a0879c3bb58fddb592c71 Mon Sep 17 00:00:00 2001
From: "Ji-Ze Hong (Peter Hong)" <hpeter(a)gmail.com>
Date: Tue, 17 Oct 2017 14:23:08 +0800
Subject: serial: 8250_fintek: Fix finding base_port with activated SuperIO
From: Ji-Ze Hong (Peter Hong) <hpeter(a)gmail.com>
commit fd97e66c5529046e989a0879c3bb58fddb592c71 upstream.
The SuperIO will be configured at boot time by BIOS, but some BIOS
will not deactivate the SuperIO when the end of configuration. It'll
lead to mismatch for pdata->base_port in probe_setup_port(). So we'll
deactivate all SuperIO before activate special base_port in
fintek_8250_enter_key().
Tested on iBASE MI802.
Tested-by: Ji-Ze Hong (Peter Hong) <hpeter+linux_kernel(a)gmail.com>
Signed-off-by: Ji-Ze Hong (Peter Hong) <hpeter+linux_kernel(a)gmail.com>
Reviewd-by: Alan Cox <alan(a)linux.intel.com>
Signed-off-by: Greg Kroah-Hartman <gregkh(a)linuxfoundation.org>
---
drivers/tty/serial/8250/8250_fintek.c | 3 +++
1 file changed, 3 insertions(+)
--- a/drivers/tty/serial/8250/8250_fintek.c
+++ b/drivers/tty/serial/8250/8250_fintek.c
@@ -54,6 +54,9 @@ static int fintek_8250_enter_key(u16 bas
if (!request_muxed_region(base_port, 2, "8250_fintek"))
return -EBUSY;
+ /* Force to deactive all SuperIO in this base_port */
+ outb(EXIT_KEY, base_port + ADDR_PORT);
+
outb(key, base_port + ADDR_PORT);
outb(key, base_port + ADDR_PORT);
return 0;
Patches currently in stable-queue which might be from hpeter(a)gmail.com are
queue-4.9/serial-8250_fintek-fix-finding-base_port-with-activated-superio.patch
This is a note to let you know that I've just added the patch titled
crypto: dh - fix memleak in setkey
to the 4.9-stable tree which can be found at:
http://www.kernel.org/git/?p=linux/kernel/git/stable/stable-queue.git;a=sum…
The filename of the patch is:
crypto-dh-fix-memleak-in-setkey.patch
and it can be found in the queue-4.9 subdirectory.
If you, or anyone else, feels it should not be added to the stable tree,
please let <stable(a)vger.kernel.org> know about it.
>From ee34e2644a78e2561742bea8c4bdcf83cabf90a7 Mon Sep 17 00:00:00 2001
From: Tudor-Dan Ambarus <tudor.ambarus(a)microchip.com>
Date: Thu, 25 May 2017 10:18:07 +0300
Subject: crypto: dh - fix memleak in setkey
From: Tudor-Dan Ambarus <tudor.ambarus(a)microchip.com>
commit ee34e2644a78e2561742bea8c4bdcf83cabf90a7 upstream.
setkey can be called multiple times during the existence
of the transformation object. In case of multiple setkey calls,
the old key was not freed and we leaked memory.
Free the old MPI key if any.
Signed-off-by: Tudor Ambarus <tudor.ambarus(a)microchip.com>
Signed-off-by: Herbert Xu <herbert(a)gondor.apana.org.au>
Signed-off-by: Greg Kroah-Hartman <gregkh(a)linuxfoundation.org>
---
crypto/dh.c | 3 +++
1 file changed, 3 insertions(+)
--- a/crypto/dh.c
+++ b/crypto/dh.c
@@ -84,6 +84,9 @@ static int dh_set_secret(struct crypto_k
struct dh_ctx *ctx = dh_get_ctx(tfm);
struct dh params;
+ /* Free the old MPI key if any */
+ dh_free_ctx(ctx);
+
if (crypto_dh_decode_key(buf, len, ¶ms) < 0)
return -EINVAL;
Patches currently in stable-queue which might be from tudor.ambarus(a)microchip.com are
queue-4.9/crypto-dh-fix-memleak-in-setkey.patch
queue-4.9/crypto-dh-fix-double-free-of-ctx-p.patch
This is a note to let you know that I've just added the patch titled
ima: do not update security.ima if appraisal status is not INTEGRITY_PASS
to the 4.9-stable tree which can be found at:
http://www.kernel.org/git/?p=linux/kernel/git/stable/stable-queue.git;a=sum…
The filename of the patch is:
ima-do-not-update-security.ima-if-appraisal-status-is-not-integrity_pass.patch
and it can be found in the queue-4.9 subdirectory.
If you, or anyone else, feels it should not be added to the stable tree,
please let <stable(a)vger.kernel.org> know about it.
>From 020aae3ee58c1af0e7ffc4e2cc9fe4dc630338cb Mon Sep 17 00:00:00 2001
From: Roberto Sassu <roberto.sassu(a)huawei.com>
Date: Tue, 7 Nov 2017 11:37:07 +0100
Subject: ima: do not update security.ima if appraisal status is not INTEGRITY_PASS
From: Roberto Sassu <roberto.sassu(a)huawei.com>
commit 020aae3ee58c1af0e7ffc4e2cc9fe4dc630338cb upstream.
Commit b65a9cfc2c38 ("Untangling ima mess, part 2: deal with counters")
moved the call of ima_file_check() from may_open() to do_filp_open() at a
point where the file descriptor is already opened.
This breaks the assumption made by IMA that file descriptors being closed
belong to files whose access was granted by ima_file_check(). The
consequence is that security.ima and security.evm are updated with good
values, regardless of the current appraisal status.
For example, if a file does not have security.ima, IMA will create it after
opening the file for writing, even if access is denied. Access to the file
will be allowed afterwards.
Avoid this issue by checking the appraisal status before updating
security.ima.
Signed-off-by: Roberto Sassu <roberto.sassu(a)huawei.com>
Signed-off-by: Mimi Zohar <zohar(a)linux.vnet.ibm.com>
Signed-off-by: James Morris <james.l.morris(a)oracle.com>
Signed-off-by: Greg Kroah-Hartman <gregkh(a)linuxfoundation.org>
---
security/integrity/ima/ima_appraise.c | 3 +++
1 file changed, 3 insertions(+)
--- a/security/integrity/ima/ima_appraise.c
+++ b/security/integrity/ima/ima_appraise.c
@@ -303,6 +303,9 @@ void ima_update_xattr(struct integrity_i
if (iint->flags & IMA_DIGSIG)
return;
+ if (iint->ima_file_status != INTEGRITY_PASS)
+ return;
+
rc = ima_collect_measurement(iint, file, NULL, 0, ima_hash_algo);
if (rc < 0)
return;
Patches currently in stable-queue which might be from roberto.sassu(a)huawei.com are
queue-4.9/ima-do-not-update-security.ima-if-appraisal-status-is-not-integrity_pass.patch
This is a note to let you know that I've just added the patch titled
serial: omap: Fix EFR write on RTS deassertion
to the 4.4-stable tree which can be found at:
http://www.kernel.org/git/?p=linux/kernel/git/stable/stable-queue.git;a=sum…
The filename of the patch is:
serial-omap-fix-efr-write-on-rts-deassertion.patch
and it can be found in the queue-4.4 subdirectory.
If you, or anyone else, feels it should not be added to the stable tree,
please let <stable(a)vger.kernel.org> know about it.
>From 2a71de2f7366fb1aec632116d0549ec56d6a3940 Mon Sep 17 00:00:00 2001
From: Lukas Wunner <lukas(a)wunner.de>
Date: Sat, 21 Oct 2017 10:50:18 +0200
Subject: serial: omap: Fix EFR write on RTS deassertion
From: Lukas Wunner <lukas(a)wunner.de>
commit 2a71de2f7366fb1aec632116d0549ec56d6a3940 upstream.
Commit 348f9bb31c56 ("serial: omap: Fix RTS handling") sought to enable
auto RTS upon manual RTS assertion and disable it on deassertion.
However it seems the latter was done incorrectly, it clears all bits in
the Extended Features Register *except* auto RTS.
Fixes: 348f9bb31c56 ("serial: omap: Fix RTS handling")
Cc: Peter Hurley <peter(a)hurleysoftware.com>
Signed-off-by: Lukas Wunner <lukas(a)wunner.de>
Signed-off-by: Greg Kroah-Hartman <gregkh(a)linuxfoundation.org>
---
drivers/tty/serial/omap-serial.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
--- a/drivers/tty/serial/omap-serial.c
+++ b/drivers/tty/serial/omap-serial.c
@@ -693,7 +693,7 @@ static void serial_omap_set_mctrl(struct
if ((mctrl & TIOCM_RTS) && (port->status & UPSTAT_AUTORTS))
up->efr |= UART_EFR_RTS;
else
- up->efr &= UART_EFR_RTS;
+ up->efr &= ~UART_EFR_RTS;
serial_out(up, UART_EFR, up->efr);
serial_out(up, UART_LCR, lcr);
Patches currently in stable-queue which might be from lukas(a)wunner.de are
queue-4.4/serial-omap-fix-efr-write-on-rts-deassertion.patch
This is a note to let you know that I've just added the patch titled
arm64: fix dump_instr when PAN and UAO are in use
to the 4.4-stable tree which can be found at:
http://www.kernel.org/git/?p=linux/kernel/git/stable/stable-queue.git;a=sum…
The filename of the patch is:
arm64-fix-dump_instr-when-pan-and-uao-are-in-use.patch
and it can be found in the queue-4.4 subdirectory.
If you, or anyone else, feels it should not be added to the stable tree,
please let <stable(a)vger.kernel.org> know about it.
>From c5cea06be060f38e5400d796e61cfc8c36e52924 Mon Sep 17 00:00:00 2001
From: Mark Rutland <mark.rutland(a)arm.com>
Date: Mon, 13 Jun 2016 11:15:14 +0100
Subject: arm64: fix dump_instr when PAN and UAO are in use
From: Mark Rutland <mark.rutland(a)arm.com>
commit c5cea06be060f38e5400d796e61cfc8c36e52924 upstream.
If the kernel is set to show unhandled signals, and a user task does not
handle a SIGILL as a result of an instruction abort, we will attempt to
log the offending instruction with dump_instr before killing the task.
We use dump_instr to log the encoding of the offending userspace
instruction. However, dump_instr is also used to dump instructions from
kernel space, and internally always switches to KERNEL_DS before dumping
the instruction with get_user. When both PAN and UAO are in use, reading
a user instruction via get_user while in KERNEL_DS will result in a
permission fault, which leads to an Oops.
As we have regs corresponding to the context of the original instruction
abort, we can inspect this and only flip to KERNEL_DS if the original
abort was taken from the kernel, avoiding this issue. At the same time,
remove the redundant (and incorrect) comments regarding the order
dump_mem and dump_instr are called in.
Cc: Catalin Marinas <catalin.marinas(a)arm.com>
Cc: James Morse <james.morse(a)arm.com>
Cc: Robin Murphy <robin.murphy(a)arm.com>
Signed-off-by: Mark Rutland <mark.rutland(a)arm.com>
Reported-by: Vladimir Murzin <vladimir.murzin(a)arm.com>
Tested-by: Vladimir Murzin <vladimir.murzin(a)arm.com>
Fixes: 57f4959bad0a154a ("arm64: kernel: Add support for User Access Override")
Signed-off-by: Will Deacon <will.deacon(a)arm.com>
Signed-off-by: Greg Kroah-Hartman <gregkh(a)linuxfoundation.org>
---
arch/arm64/kernel/traps.c | 26 +++++++++++++-------------
1 file changed, 13 insertions(+), 13 deletions(-)
--- a/arch/arm64/kernel/traps.c
+++ b/arch/arm64/kernel/traps.c
@@ -64,8 +64,7 @@ static void dump_mem(const char *lvl, co
/*
* We need to switch to kernel mode so that we can use __get_user
- * to safely read from kernel space. Note that we now dump the
- * code first, just in case the backtrace kills us.
+ * to safely read from kernel space.
*/
fs = get_fs();
set_fs(KERNEL_DS);
@@ -111,21 +110,12 @@ static void dump_backtrace_entry(unsigne
print_ip_sym(where);
}
-static void dump_instr(const char *lvl, struct pt_regs *regs)
+static void __dump_instr(const char *lvl, struct pt_regs *regs)
{
unsigned long addr = instruction_pointer(regs);
- mm_segment_t fs;
char str[sizeof("00000000 ") * 5 + 2 + 1], *p = str;
int i;
- /*
- * We need to switch to kernel mode so that we can use __get_user
- * to safely read from kernel space. Note that we now dump the
- * code first, just in case the backtrace kills us.
- */
- fs = get_fs();
- set_fs(KERNEL_DS);
-
for (i = -4; i < 1; i++) {
unsigned int val, bad;
@@ -139,8 +129,18 @@ static void dump_instr(const char *lvl,
}
}
printk("%sCode: %s\n", lvl, str);
+}
- set_fs(fs);
+static void dump_instr(const char *lvl, struct pt_regs *regs)
+{
+ if (!user_mode(regs)) {
+ mm_segment_t fs = get_fs();
+ set_fs(KERNEL_DS);
+ __dump_instr(lvl, regs);
+ set_fs(fs);
+ } else {
+ __dump_instr(lvl, regs);
+ }
}
static void dump_backtrace(struct pt_regs *regs, struct task_struct *tsk)
Patches currently in stable-queue which might be from mark.rutland(a)arm.com are
queue-4.4/arm64-fix-dump_instr-when-pan-and-uao-are-in-use.patch
This is a note to let you know that I've just added the patch titled
ima: do not update security.ima if appraisal status is not INTEGRITY_PASS
to the 4.4-stable tree which can be found at:
http://www.kernel.org/git/?p=linux/kernel/git/stable/stable-queue.git;a=sum…
The filename of the patch is:
ima-do-not-update-security.ima-if-appraisal-status-is-not-integrity_pass.patch
and it can be found in the queue-4.4 subdirectory.
If you, or anyone else, feels it should not be added to the stable tree,
please let <stable(a)vger.kernel.org> know about it.
>From 020aae3ee58c1af0e7ffc4e2cc9fe4dc630338cb Mon Sep 17 00:00:00 2001
From: Roberto Sassu <roberto.sassu(a)huawei.com>
Date: Tue, 7 Nov 2017 11:37:07 +0100
Subject: ima: do not update security.ima if appraisal status is not INTEGRITY_PASS
From: Roberto Sassu <roberto.sassu(a)huawei.com>
commit 020aae3ee58c1af0e7ffc4e2cc9fe4dc630338cb upstream.
Commit b65a9cfc2c38 ("Untangling ima mess, part 2: deal with counters")
moved the call of ima_file_check() from may_open() to do_filp_open() at a
point where the file descriptor is already opened.
This breaks the assumption made by IMA that file descriptors being closed
belong to files whose access was granted by ima_file_check(). The
consequence is that security.ima and security.evm are updated with good
values, regardless of the current appraisal status.
For example, if a file does not have security.ima, IMA will create it after
opening the file for writing, even if access is denied. Access to the file
will be allowed afterwards.
Avoid this issue by checking the appraisal status before updating
security.ima.
Signed-off-by: Roberto Sassu <roberto.sassu(a)huawei.com>
Signed-off-by: Mimi Zohar <zohar(a)linux.vnet.ibm.com>
Signed-off-by: James Morris <james.l.morris(a)oracle.com>
Signed-off-by: Greg Kroah-Hartman <gregkh(a)linuxfoundation.org>
---
security/integrity/ima/ima_appraise.c | 3 +++
1 file changed, 3 insertions(+)
--- a/security/integrity/ima/ima_appraise.c
+++ b/security/integrity/ima/ima_appraise.c
@@ -297,6 +297,9 @@ void ima_update_xattr(struct integrity_i
if (iint->flags & IMA_DIGSIG)
return;
+ if (iint->ima_file_status != INTEGRITY_PASS)
+ return;
+
rc = ima_collect_measurement(iint, file, NULL, NULL);
if (rc < 0)
return;
Patches currently in stable-queue which might be from roberto.sassu(a)huawei.com are
queue-4.4/ima-do-not-update-security.ima-if-appraisal-status-is-not-integrity_pass.patch
This is a note to let you know that I've just added the patch titled
tpm-dev-common: Reject too short writes
to the 4.14-stable tree which can be found at:
http://www.kernel.org/git/?p=linux/kernel/git/stable/stable-queue.git;a=sum…
The filename of the patch is:
tpm-dev-common-reject-too-short-writes.patch
and it can be found in the queue-4.14 subdirectory.
If you, or anyone else, feels it should not be added to the stable tree,
please let <stable(a)vger.kernel.org> know about it.
>From ee70bc1e7b63ac8023c9ff9475d8741e397316e7 Mon Sep 17 00:00:00 2001
From: Alexander Steffen <Alexander.Steffen(a)infineon.com>
Date: Fri, 8 Sep 2017 17:21:32 +0200
Subject: tpm-dev-common: Reject too short writes
From: Alexander Steffen <Alexander.Steffen(a)infineon.com>
commit ee70bc1e7b63ac8023c9ff9475d8741e397316e7 upstream.
tpm_transmit() does not offer an explicit interface to indicate the number
of valid bytes in the communication buffer. Instead, it relies on the
commandSize field in the TPM header that is encoded within the buffer.
Therefore, ensure that a) enough data has been written to the buffer, so
that the commandSize field is present and b) the commandSize field does not
announce more data than has been written to the buffer.
This should have been fixed with CVE-2011-1161 long ago, but apparently
a correct version of that patch never made it into the kernel.
Signed-off-by: Alexander Steffen <Alexander.Steffen(a)infineon.com>
Reviewed-by: Jarkko Sakkinen <jarkko.sakkinen(a)linux.intel.com>
Tested-by: Jarkko Sakkinen <jarkko.sakkinen(a)linux.intel.com>
Signed-off-by: Jarkko Sakkinen <jarkko.sakkinen(a)linux.intel.com>
Signed-off-by: Greg Kroah-Hartman <gregkh(a)linuxfoundation.org>
---
drivers/char/tpm/tpm-dev-common.c | 6 ++++++
1 file changed, 6 insertions(+)
--- a/drivers/char/tpm/tpm-dev-common.c
+++ b/drivers/char/tpm/tpm-dev-common.c
@@ -110,6 +110,12 @@ ssize_t tpm_common_write(struct file *fi
return -EFAULT;
}
+ if (in_size < 6 ||
+ in_size < be32_to_cpu(*((__be32 *) (priv->data_buffer + 2)))) {
+ mutex_unlock(&priv->buffer_mutex);
+ return -EINVAL;
+ }
+
/* atomic tpm command send and result receive. We only hold the ops
* lock during this period so that the tpm can be unregistered even if
* the char dev is held open.
Patches currently in stable-queue which might be from Alexander.Steffen(a)infineon.com are
queue-4.14/tpm-dev-common-reject-too-short-writes.patch
This is a note to let you know that I've just added the patch titled
serial: 8250_fintek: Fix finding base_port with activated SuperIO
to the 4.14-stable tree which can be found at:
http://www.kernel.org/git/?p=linux/kernel/git/stable/stable-queue.git;a=sum…
The filename of the patch is:
serial-8250_fintek-fix-finding-base_port-with-activated-superio.patch
and it can be found in the queue-4.14 subdirectory.
If you, or anyone else, feels it should not be added to the stable tree,
please let <stable(a)vger.kernel.org> know about it.
>From fd97e66c5529046e989a0879c3bb58fddb592c71 Mon Sep 17 00:00:00 2001
From: "Ji-Ze Hong (Peter Hong)" <hpeter(a)gmail.com>
Date: Tue, 17 Oct 2017 14:23:08 +0800
Subject: serial: 8250_fintek: Fix finding base_port with activated SuperIO
From: Ji-Ze Hong (Peter Hong) <hpeter(a)gmail.com>
commit fd97e66c5529046e989a0879c3bb58fddb592c71 upstream.
The SuperIO will be configured at boot time by BIOS, but some BIOS
will not deactivate the SuperIO when the end of configuration. It'll
lead to mismatch for pdata->base_port in probe_setup_port(). So we'll
deactivate all SuperIO before activate special base_port in
fintek_8250_enter_key().
Tested on iBASE MI802.
Tested-by: Ji-Ze Hong (Peter Hong) <hpeter+linux_kernel(a)gmail.com>
Signed-off-by: Ji-Ze Hong (Peter Hong) <hpeter+linux_kernel(a)gmail.com>
Reviewd-by: Alan Cox <alan(a)linux.intel.com>
Signed-off-by: Greg Kroah-Hartman <gregkh(a)linuxfoundation.org>
---
drivers/tty/serial/8250/8250_fintek.c | 3 +++
1 file changed, 3 insertions(+)
--- a/drivers/tty/serial/8250/8250_fintek.c
+++ b/drivers/tty/serial/8250/8250_fintek.c
@@ -118,6 +118,9 @@ static int fintek_8250_enter_key(u16 bas
if (!request_muxed_region(base_port, 2, "8250_fintek"))
return -EBUSY;
+ /* Force to deactive all SuperIO in this base_port */
+ outb(EXIT_KEY, base_port + ADDR_PORT);
+
outb(key, base_port + ADDR_PORT);
outb(key, base_port + ADDR_PORT);
return 0;
Patches currently in stable-queue which might be from hpeter(a)gmail.com are
queue-4.14/serial-8250_fintek-fix-finding-base_port-with-activated-superio.patch
This is a note to let you know that I've just added the patch titled
serial: omap: Fix EFR write on RTS deassertion
to the 4.14-stable tree which can be found at:
http://www.kernel.org/git/?p=linux/kernel/git/stable/stable-queue.git;a=sum…
The filename of the patch is:
serial-omap-fix-efr-write-on-rts-deassertion.patch
and it can be found in the queue-4.14 subdirectory.
If you, or anyone else, feels it should not be added to the stable tree,
please let <stable(a)vger.kernel.org> know about it.
>From 2a71de2f7366fb1aec632116d0549ec56d6a3940 Mon Sep 17 00:00:00 2001
From: Lukas Wunner <lukas(a)wunner.de>
Date: Sat, 21 Oct 2017 10:50:18 +0200
Subject: serial: omap: Fix EFR write on RTS deassertion
From: Lukas Wunner <lukas(a)wunner.de>
commit 2a71de2f7366fb1aec632116d0549ec56d6a3940 upstream.
Commit 348f9bb31c56 ("serial: omap: Fix RTS handling") sought to enable
auto RTS upon manual RTS assertion and disable it on deassertion.
However it seems the latter was done incorrectly, it clears all bits in
the Extended Features Register *except* auto RTS.
Fixes: 348f9bb31c56 ("serial: omap: Fix RTS handling")
Cc: Peter Hurley <peter(a)hurleysoftware.com>
Signed-off-by: Lukas Wunner <lukas(a)wunner.de>
Signed-off-by: Greg Kroah-Hartman <gregkh(a)linuxfoundation.org>
---
drivers/tty/serial/omap-serial.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
--- a/drivers/tty/serial/omap-serial.c
+++ b/drivers/tty/serial/omap-serial.c
@@ -693,7 +693,7 @@ static void serial_omap_set_mctrl(struct
if ((mctrl & TIOCM_RTS) && (port->status & UPSTAT_AUTORTS))
up->efr |= UART_EFR_RTS;
else
- up->efr &= UART_EFR_RTS;
+ up->efr &= ~UART_EFR_RTS;
serial_out(up, UART_EFR, up->efr);
serial_out(up, UART_LCR, lcr);
Patches currently in stable-queue which might be from lukas(a)wunner.de are
queue-4.14/serial-omap-fix-efr-write-on-rts-deassertion.patch
This is a note to let you know that I've just added the patch titled
rcu: Fix up pending cbs check in rcu_prepare_for_idle
to the 4.14-stable tree which can be found at:
http://www.kernel.org/git/?p=linux/kernel/git/stable/stable-queue.git;a=sum…
The filename of the patch is:
rcu-fix-up-pending-cbs-check-in-rcu_prepare_for_idle.patch
and it can be found in the queue-4.14 subdirectory.
If you, or anyone else, feels it should not be added to the stable tree,
please let <stable(a)vger.kernel.org> know about it.
>From 135bd1a230bb69a68c9808a7d25467318900b80a Mon Sep 17 00:00:00 2001
From: Neeraj Upadhyay <neeraju(a)codeaurora.org>
Date: Mon, 7 Aug 2017 11:20:10 +0530
Subject: rcu: Fix up pending cbs check in rcu_prepare_for_idle
From: Neeraj Upadhyay <neeraju(a)codeaurora.org>
commit 135bd1a230bb69a68c9808a7d25467318900b80a upstream.
The pending-callbacks check in rcu_prepare_for_idle() is backwards.
It should accelerate if there are pending callbacks, but the check
rather uselessly accelerates only if there are no callbacks. This commit
therefore inverts this check.
Fixes: 15fecf89e46a ("srcu: Abstract multi-tail callback list handling")
Signed-off-by: Neeraj Upadhyay <neeraju(a)codeaurora.org>
Signed-off-by: Paul E. McKenney <paulmck(a)linux.vnet.ibm.com>
Signed-off-by: Greg Kroah-Hartman <gregkh(a)linuxfoundation.org>
---
kernel/rcu/tree_plugin.h | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
--- a/kernel/rcu/tree_plugin.h
+++ b/kernel/rcu/tree_plugin.h
@@ -1507,7 +1507,7 @@ static void rcu_prepare_for_idle(void)
rdtp->last_accelerate = jiffies;
for_each_rcu_flavor(rsp) {
rdp = this_cpu_ptr(rsp->rda);
- if (rcu_segcblist_pend_cbs(&rdp->cblist))
+ if (!rcu_segcblist_pend_cbs(&rdp->cblist))
continue;
rnp = rdp->mynode;
raw_spin_lock_rcu_node(rnp); /* irqs already disabled. */
Patches currently in stable-queue which might be from neeraju(a)codeaurora.org are
queue-4.14/rcu-fix-up-pending-cbs-check-in-rcu_prepare_for_idle.patch
This is a note to let you know that I've just added the patch titled
ima: do not update security.ima if appraisal status is not INTEGRITY_PASS
to the 4.14-stable tree which can be found at:
http://www.kernel.org/git/?p=linux/kernel/git/stable/stable-queue.git;a=sum…
The filename of the patch is:
ima-do-not-update-security.ima-if-appraisal-status-is-not-integrity_pass.patch
and it can be found in the queue-4.14 subdirectory.
If you, or anyone else, feels it should not be added to the stable tree,
please let <stable(a)vger.kernel.org> know about it.
>From 020aae3ee58c1af0e7ffc4e2cc9fe4dc630338cb Mon Sep 17 00:00:00 2001
From: Roberto Sassu <roberto.sassu(a)huawei.com>
Date: Tue, 7 Nov 2017 11:37:07 +0100
Subject: ima: do not update security.ima if appraisal status is not INTEGRITY_PASS
From: Roberto Sassu <roberto.sassu(a)huawei.com>
commit 020aae3ee58c1af0e7ffc4e2cc9fe4dc630338cb upstream.
Commit b65a9cfc2c38 ("Untangling ima mess, part 2: deal with counters")
moved the call of ima_file_check() from may_open() to do_filp_open() at a
point where the file descriptor is already opened.
This breaks the assumption made by IMA that file descriptors being closed
belong to files whose access was granted by ima_file_check(). The
consequence is that security.ima and security.evm are updated with good
values, regardless of the current appraisal status.
For example, if a file does not have security.ima, IMA will create it after
opening the file for writing, even if access is denied. Access to the file
will be allowed afterwards.
Avoid this issue by checking the appraisal status before updating
security.ima.
Signed-off-by: Roberto Sassu <roberto.sassu(a)huawei.com>
Signed-off-by: Mimi Zohar <zohar(a)linux.vnet.ibm.com>
Signed-off-by: James Morris <james.l.morris(a)oracle.com>
Signed-off-by: Greg Kroah-Hartman <gregkh(a)linuxfoundation.org>
---
security/integrity/ima/ima_appraise.c | 3 +++
1 file changed, 3 insertions(+)
--- a/security/integrity/ima/ima_appraise.c
+++ b/security/integrity/ima/ima_appraise.c
@@ -320,6 +320,9 @@ void ima_update_xattr(struct integrity_i
if (iint->flags & IMA_DIGSIG)
return;
+ if (iint->ima_file_status != INTEGRITY_PASS)
+ return;
+
rc = ima_collect_measurement(iint, file, NULL, 0, ima_hash_algo);
if (rc < 0)
return;
Patches currently in stable-queue which might be from roberto.sassu(a)huawei.com are
queue-4.14/ima-do-not-update-security.ima-if-appraisal-status-is-not-integrity_pass.patch
This is a note to let you know that I've just added the patch titled
tpm-dev-common: Reject too short writes
to the 4.13-stable tree which can be found at:
http://www.kernel.org/git/?p=linux/kernel/git/stable/stable-queue.git;a=sum…
The filename of the patch is:
tpm-dev-common-reject-too-short-writes.patch
and it can be found in the queue-4.13 subdirectory.
If you, or anyone else, feels it should not be added to the stable tree,
please let <stable(a)vger.kernel.org> know about it.
>From ee70bc1e7b63ac8023c9ff9475d8741e397316e7 Mon Sep 17 00:00:00 2001
From: Alexander Steffen <Alexander.Steffen(a)infineon.com>
Date: Fri, 8 Sep 2017 17:21:32 +0200
Subject: tpm-dev-common: Reject too short writes
From: Alexander Steffen <Alexander.Steffen(a)infineon.com>
commit ee70bc1e7b63ac8023c9ff9475d8741e397316e7 upstream.
tpm_transmit() does not offer an explicit interface to indicate the number
of valid bytes in the communication buffer. Instead, it relies on the
commandSize field in the TPM header that is encoded within the buffer.
Therefore, ensure that a) enough data has been written to the buffer, so
that the commandSize field is present and b) the commandSize field does not
announce more data than has been written to the buffer.
This should have been fixed with CVE-2011-1161 long ago, but apparently
a correct version of that patch never made it into the kernel.
Signed-off-by: Alexander Steffen <Alexander.Steffen(a)infineon.com>
Reviewed-by: Jarkko Sakkinen <jarkko.sakkinen(a)linux.intel.com>
Tested-by: Jarkko Sakkinen <jarkko.sakkinen(a)linux.intel.com>
Signed-off-by: Jarkko Sakkinen <jarkko.sakkinen(a)linux.intel.com>
Signed-off-by: Greg Kroah-Hartman <gregkh(a)linuxfoundation.org>
---
drivers/char/tpm/tpm-dev-common.c | 6 ++++++
1 file changed, 6 insertions(+)
--- a/drivers/char/tpm/tpm-dev-common.c
+++ b/drivers/char/tpm/tpm-dev-common.c
@@ -110,6 +110,12 @@ ssize_t tpm_common_write(struct file *fi
return -EFAULT;
}
+ if (in_size < 6 ||
+ in_size < be32_to_cpu(*((__be32 *) (priv->data_buffer + 2)))) {
+ mutex_unlock(&priv->buffer_mutex);
+ return -EINVAL;
+ }
+
/* atomic tpm command send and result receive. We only hold the ops
* lock during this period so that the tpm can be unregistered even if
* the char dev is held open.
Patches currently in stable-queue which might be from Alexander.Steffen(a)infineon.com are
queue-4.13/tpm-dev-common-reject-too-short-writes.patch
This is a note to let you know that I've just added the patch titled
serial: 8250_fintek: Fix finding base_port with activated SuperIO
to the 4.13-stable tree which can be found at:
http://www.kernel.org/git/?p=linux/kernel/git/stable/stable-queue.git;a=sum…
The filename of the patch is:
serial-8250_fintek-fix-finding-base_port-with-activated-superio.patch
and it can be found in the queue-4.13 subdirectory.
If you, or anyone else, feels it should not be added to the stable tree,
please let <stable(a)vger.kernel.org> know about it.
>From fd97e66c5529046e989a0879c3bb58fddb592c71 Mon Sep 17 00:00:00 2001
From: "Ji-Ze Hong (Peter Hong)" <hpeter(a)gmail.com>
Date: Tue, 17 Oct 2017 14:23:08 +0800
Subject: serial: 8250_fintek: Fix finding base_port with activated SuperIO
From: Ji-Ze Hong (Peter Hong) <hpeter(a)gmail.com>
commit fd97e66c5529046e989a0879c3bb58fddb592c71 upstream.
The SuperIO will be configured at boot time by BIOS, but some BIOS
will not deactivate the SuperIO when the end of configuration. It'll
lead to mismatch for pdata->base_port in probe_setup_port(). So we'll
deactivate all SuperIO before activate special base_port in
fintek_8250_enter_key().
Tested on iBASE MI802.
Tested-by: Ji-Ze Hong (Peter Hong) <hpeter+linux_kernel(a)gmail.com>
Signed-off-by: Ji-Ze Hong (Peter Hong) <hpeter+linux_kernel(a)gmail.com>
Reviewd-by: Alan Cox <alan(a)linux.intel.com>
Signed-off-by: Greg Kroah-Hartman <gregkh(a)linuxfoundation.org>
---
drivers/tty/serial/8250/8250_fintek.c | 3 +++
1 file changed, 3 insertions(+)
--- a/drivers/tty/serial/8250/8250_fintek.c
+++ b/drivers/tty/serial/8250/8250_fintek.c
@@ -118,6 +118,9 @@ static int fintek_8250_enter_key(u16 bas
if (!request_muxed_region(base_port, 2, "8250_fintek"))
return -EBUSY;
+ /* Force to deactive all SuperIO in this base_port */
+ outb(EXIT_KEY, base_port + ADDR_PORT);
+
outb(key, base_port + ADDR_PORT);
outb(key, base_port + ADDR_PORT);
return 0;
Patches currently in stable-queue which might be from hpeter(a)gmail.com are
queue-4.13/serial-8250_fintek-fix-finding-base_port-with-activated-superio.patch
This is a note to let you know that I've just added the patch titled
serial: omap: Fix EFR write on RTS deassertion
to the 4.13-stable tree which can be found at:
http://www.kernel.org/git/?p=linux/kernel/git/stable/stable-queue.git;a=sum…
The filename of the patch is:
serial-omap-fix-efr-write-on-rts-deassertion.patch
and it can be found in the queue-4.13 subdirectory.
If you, or anyone else, feels it should not be added to the stable tree,
please let <stable(a)vger.kernel.org> know about it.
>From 2a71de2f7366fb1aec632116d0549ec56d6a3940 Mon Sep 17 00:00:00 2001
From: Lukas Wunner <lukas(a)wunner.de>
Date: Sat, 21 Oct 2017 10:50:18 +0200
Subject: serial: omap: Fix EFR write on RTS deassertion
From: Lukas Wunner <lukas(a)wunner.de>
commit 2a71de2f7366fb1aec632116d0549ec56d6a3940 upstream.
Commit 348f9bb31c56 ("serial: omap: Fix RTS handling") sought to enable
auto RTS upon manual RTS assertion and disable it on deassertion.
However it seems the latter was done incorrectly, it clears all bits in
the Extended Features Register *except* auto RTS.
Fixes: 348f9bb31c56 ("serial: omap: Fix RTS handling")
Cc: Peter Hurley <peter(a)hurleysoftware.com>
Signed-off-by: Lukas Wunner <lukas(a)wunner.de>
Signed-off-by: Greg Kroah-Hartman <gregkh(a)linuxfoundation.org>
---
drivers/tty/serial/omap-serial.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
--- a/drivers/tty/serial/omap-serial.c
+++ b/drivers/tty/serial/omap-serial.c
@@ -693,7 +693,7 @@ static void serial_omap_set_mctrl(struct
if ((mctrl & TIOCM_RTS) && (port->status & UPSTAT_AUTORTS))
up->efr |= UART_EFR_RTS;
else
- up->efr &= UART_EFR_RTS;
+ up->efr &= ~UART_EFR_RTS;
serial_out(up, UART_EFR, up->efr);
serial_out(up, UART_LCR, lcr);
Patches currently in stable-queue which might be from lukas(a)wunner.de are
queue-4.13/serial-omap-fix-efr-write-on-rts-deassertion.patch
This is a note to let you know that I've just added the patch titled
ima: do not update security.ima if appraisal status is not INTEGRITY_PASS
to the 4.13-stable tree which can be found at:
http://www.kernel.org/git/?p=linux/kernel/git/stable/stable-queue.git;a=sum…
The filename of the patch is:
ima-do-not-update-security.ima-if-appraisal-status-is-not-integrity_pass.patch
and it can be found in the queue-4.13 subdirectory.
If you, or anyone else, feels it should not be added to the stable tree,
please let <stable(a)vger.kernel.org> know about it.
>From 020aae3ee58c1af0e7ffc4e2cc9fe4dc630338cb Mon Sep 17 00:00:00 2001
From: Roberto Sassu <roberto.sassu(a)huawei.com>
Date: Tue, 7 Nov 2017 11:37:07 +0100
Subject: ima: do not update security.ima if appraisal status is not INTEGRITY_PASS
From: Roberto Sassu <roberto.sassu(a)huawei.com>
commit 020aae3ee58c1af0e7ffc4e2cc9fe4dc630338cb upstream.
Commit b65a9cfc2c38 ("Untangling ima mess, part 2: deal with counters")
moved the call of ima_file_check() from may_open() to do_filp_open() at a
point where the file descriptor is already opened.
This breaks the assumption made by IMA that file descriptors being closed
belong to files whose access was granted by ima_file_check(). The
consequence is that security.ima and security.evm are updated with good
values, regardless of the current appraisal status.
For example, if a file does not have security.ima, IMA will create it after
opening the file for writing, even if access is denied. Access to the file
will be allowed afterwards.
Avoid this issue by checking the appraisal status before updating
security.ima.
Signed-off-by: Roberto Sassu <roberto.sassu(a)huawei.com>
Signed-off-by: Mimi Zohar <zohar(a)linux.vnet.ibm.com>
Signed-off-by: James Morris <james.l.morris(a)oracle.com>
Signed-off-by: Greg Kroah-Hartman <gregkh(a)linuxfoundation.org>
---
security/integrity/ima/ima_appraise.c | 3 +++
1 file changed, 3 insertions(+)
--- a/security/integrity/ima/ima_appraise.c
+++ b/security/integrity/ima/ima_appraise.c
@@ -320,6 +320,9 @@ void ima_update_xattr(struct integrity_i
if (iint->flags & IMA_DIGSIG)
return;
+ if (iint->ima_file_status != INTEGRITY_PASS)
+ return;
+
rc = ima_collect_measurement(iint, file, NULL, 0, ima_hash_algo);
if (rc < 0)
return;
Patches currently in stable-queue which might be from roberto.sassu(a)huawei.com are
queue-4.13/ima-do-not-update-security.ima-if-appraisal-status-is-not-integrity_pass.patch
This is a note to let you know that I've just added the patch titled
ima: do not update security.ima if appraisal status is not INTEGRITY_PASS
to the 3.18-stable tree which can be found at:
http://www.kernel.org/git/?p=linux/kernel/git/stable/stable-queue.git;a=sum…
The filename of the patch is:
ima-do-not-update-security.ima-if-appraisal-status-is-not-integrity_pass.patch
and it can be found in the queue-3.18 subdirectory.
If you, or anyone else, feels it should not be added to the stable tree,
please let <stable(a)vger.kernel.org> know about it.
>From 020aae3ee58c1af0e7ffc4e2cc9fe4dc630338cb Mon Sep 17 00:00:00 2001
From: Roberto Sassu <roberto.sassu(a)huawei.com>
Date: Tue, 7 Nov 2017 11:37:07 +0100
Subject: ima: do not update security.ima if appraisal status is not INTEGRITY_PASS
From: Roberto Sassu <roberto.sassu(a)huawei.com>
commit 020aae3ee58c1af0e7ffc4e2cc9fe4dc630338cb upstream.
Commit b65a9cfc2c38 ("Untangling ima mess, part 2: deal with counters")
moved the call of ima_file_check() from may_open() to do_filp_open() at a
point where the file descriptor is already opened.
This breaks the assumption made by IMA that file descriptors being closed
belong to files whose access was granted by ima_file_check(). The
consequence is that security.ima and security.evm are updated with good
values, regardless of the current appraisal status.
For example, if a file does not have security.ima, IMA will create it after
opening the file for writing, even if access is denied. Access to the file
will be allowed afterwards.
Avoid this issue by checking the appraisal status before updating
security.ima.
Signed-off-by: Roberto Sassu <roberto.sassu(a)huawei.com>
Signed-off-by: Mimi Zohar <zohar(a)linux.vnet.ibm.com>
Signed-off-by: James Morris <james.l.morris(a)oracle.com>
Signed-off-by: Greg Kroah-Hartman <gregkh(a)linuxfoundation.org>
---
security/integrity/ima/ima_appraise.c | 3 +++
1 file changed, 3 insertions(+)
--- a/security/integrity/ima/ima_appraise.c
+++ b/security/integrity/ima/ima_appraise.c
@@ -297,6 +297,9 @@ void ima_update_xattr(struct integrity_i
if (iint->flags & IMA_DIGSIG)
return;
+ if (iint->ima_file_status != INTEGRITY_PASS)
+ return;
+
rc = ima_collect_measurement(iint, file, NULL, NULL);
if (rc < 0)
return;
Patches currently in stable-queue which might be from roberto.sassu(a)huawei.com are
queue-3.18/ima-do-not-update-security.ima-if-appraisal-status-is-not-integrity_pass.patch
On Mon, 2017-11-06 at 10:44 +0100, Greg Kroah-Hartman wrote:
> 4.4-stable review patch. If anyone has any objections, please let me know.
>
> ------------------
>
> From: Mark Rutland <mark.rutland(a)arm.com>
>
> commit 7a7003b1da010d2b0d1dc8bf21c10f5c73b389f1 upstream.
>
> It's possible for a user to deliberately trigger __dump_instr with a
> chosen kernel address.
>
> Let's avoid problems resulting from this by using get_user() rather than
> __get_user(), ensuring that we don't erroneously access kernel memory.
>
> Where we use __dump_instr() on kernel text, we already switch to
> KERNEL_DS, so this shouldn't adversely affect those cases.
>
> Fixes: 60ffc30d5652810d ("arm64: Exception handling")
[...]
This seems harmless, but I don't think it will fix the bug in 4.4
unless you also cherry-pick:
commit c5cea06be060f38e5400d796e61cfc8c36e52924
Author: Mark Rutland <mark.rutland(a)arm.com>
Date: Mon Jun 13 11:15:14 2016 +0100
arm64: fix dump_instr when PAN and UAO are in use
Ben.
--
Ben Hutchings
Software Developer, Codethink Ltd.
Hi---
Previously sent to stable was this patch, and it's made it into
Linus's tree for 4.16. It's a rather urgent data corruption issue
that affects at minimum bcache but possibly other block subsystems.
Several users have lost data. The issue was introduced in the 4.15
branch by 74d46992e0d9 ("block: replace bi_bdev with a gendisk pointer
and partitions index").
I'd really appreciate anything that can be done to get this into a
stable 4.15 release soon.
Thanks--
Mike
On Tue, Nov 21, 2017 at 06:34:54PM +0200, Leon Romanovsky wrote:
> On Tue, Nov 21, 2017 at 09:04:42AM -0700, Jason Gunthorpe wrote:
> > On Tue, Nov 21, 2017 at 09:37:27AM -0600, Daniel Jurgens wrote:
> >
> > > The only warning that would make sense is if the mixed ports aren't
> > > all IB or RoCE. As you note, CX-3 can mix those two, we don't want to
> > > see warnings about that.
> >
> > I would really like to see cx3 be changed to not do that, then we
> > could finalize this issue upstream: All device ports must be the same
> > protocol.
>
> I don't see the point of such artificial limitation, the users who
> brought CX-3 have option to work in mixed mode and IMHO it is not right
> to deprecate such ability just because it is hard for us to code for it.
I don't really think it is really too user visible.. Only the device
and port number change, but only if running in mixed mode.
It is not just 'hard for us' it is impossible to reconcile the
differences between ports when enforcing device level things.
This keeps coming up again and again..
Jason
On 21/11/2017 15:22, Leon Romanovsky wrote:
> On Tue, Nov 21, 2017 at 12:44:10PM +0200, Mark Bloch wrote:
>> Hi,
>>
>> On 21/11/2017 12:26, Leon Romanovsky wrote:
>>> From: Daniel Jurgens <danielj(a)mellanox.com>
>>>
>>> For now the only LSM security enforcement mechanism available is
>>> specific to InfiniBand. Bypass enforcement for non-IB link types.
>>> This fixes a regression where modify_qp fails for iWARP because
>>> querying the PKEY returns -EINVAL.
>>>
>>> Cc: Paul Moore <paul(a)paul-moore.com>
>>> Cc: Don Dutile <ddutile(a)redhat.com>
>>> Cc: stable(a)vger.kernel.org
>>> Reported-by: Potnuri Bharat Teja <bharat(a)chelsio.com>
>>> Fixes: d291f1a65232("IB/core: Enforce PKey security on QPs")
>>> Fixes: 47a2b338fe63("IB/core: Enforce security on management datagrams")
>>> Signed-off-by: Daniel Jurgens <danielj(a)mellanox.com>
>>> Reviewed-by: Parav Pandit <parav(a)mellanox.com>
>>> Tested-by: Potnuri Bharat Teja <bharat(a)chelsio.com>
>>> Signed-off-by: Leon Romanovsky <leon(a)kernel.org>
>>> ---
>>> drivers/infiniband/core/security.c | 9 +++++++++
>>> 1 file changed, 9 insertions(+)
>>>
>>> diff --git a/drivers/infiniband/core/security.c b/drivers/infiniband/core/security.c
>>> index 23278ed5be45..314bf1137c7b 100644
>>> --- a/drivers/infiniband/core/security.c
>>> +++ b/drivers/infiniband/core/security.c
>>> @@ -417,8 +417,17 @@ void ib_close_shared_qp_security(struct ib_qp_security *sec)
>>>
>>> int ib_create_qp_security(struct ib_qp *qp, struct ib_device *dev)
>>> {
>>> + u8 i = rdma_start_port(dev);
>>> + bool is_ib = false;
>>> int ret;
>>>
>>> + while (i <= rdma_end_port(dev) && !is_ib)
>>> + is_ib = rdma_protocol_ib(dev, i++);
>>> +
>>
>> What happens if we have mixed port types?
>
> We will have is_ib set and qp_sec will be allocated on device layer and
> not on port level, but because pkeys are IB specific term (at least, I
> didn't find any mentioning in RoCE spec), the modify_qp won't query
> for PKEYS.
>
What is port level for qp_sec?
I just gave mlx4 as an example, but I was talking more about the ability of the RDMA
subsystem to support mixed port types, so in this case, if in the future a vendor
will come with an ib_device with 2 ports, one is IB and one is iWARP bad things will happen.
>> I believe mlx4 can expose two ports where each port uses a different ll protocol.
>> Was that changed?
>
> It is still true.
>
>>
>>> + /* If this isn't an IB device don't create the security context */
>>> + if (!is_ib)
>>> + return 0;
>>> +
>>> qp->qp_sec = kzalloc(sizeof(*qp->qp_sec), GFP_KERNEL);
>>> if (!qp->qp_sec)
>>> return -ENOMEM;
>>>
>>
>> Mark.
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
>> the body of a message to majordomo(a)vger.kernel.org
>> More majordomo info at http://vger.kernel.org/majordomo-info.html
Mark
From: Yazen Ghannam <yazen.ghannam(a)amd.com>
[Upstream commit d65dfc81bb3894fdb68cbc74bbf5fb48d2354071]
The AMD severity grading function was introduced in kernel 4.1. The
current logic can possibly give MCE_AR_SEVERITY for uncorrectable
errors in kernel context. The system may then get stuck in a loop as
memory_failure() will try to handle the bad kernel memory and find it
busy.
Return MCE_PANIC_SEVERITY for all UC errors IN_KERNEL context on AMD
systems.
After:
b2f9d678e28c ("x86/mce: Check for faults tagged in EXTABLE_CLASS_FAULT exception table entries")
was accepted in v4.6, this issue was masked because of the tail-end attempt
at kernel mode recovery in the #MC handler.
However, uncorrectable errors IN_KERNEL context should always be considered
unrecoverable and cause a panic.
Signed-off-by: Yazen Ghannam <yazen.ghannam(a)amd.com>
Cc: <stable(a)vger.kernel.org> # 4.1.x, 4.4.x
Fixes: bf80bbd7dcf5 (x86/mce: Add an AMD severities-grading function)
---
Same as
d65dfc81bb38 ("x86/MCE/AMD: Always give panic severity for UC errors in kernel context")
but fixed up to apply to v4.1 and v4.4 stable branches.
arch/x86/kernel/cpu/mcheck/mce-severity.c | 7 +++----
1 file changed, 3 insertions(+), 4 deletions(-)
diff --git a/arch/x86/kernel/cpu/mcheck/mce-severity.c b/arch/x86/kernel/cpu/mcheck/mce-severity.c
index 9c682c222071..a9287f0f06f2 100644
--- a/arch/x86/kernel/cpu/mcheck/mce-severity.c
+++ b/arch/x86/kernel/cpu/mcheck/mce-severity.c
@@ -200,6 +200,9 @@ static int mce_severity_amd(struct mce *m, int tolerant, char **msg, bool is_exc
if (m->status & MCI_STATUS_UC) {
+ if (ctx == IN_KERNEL)
+ return MCE_PANIC_SEVERITY;
+
/*
* On older systems where overflow_recov flag is not present, we
* should simply panic if an error overflow occurs. If
@@ -207,10 +210,6 @@ static int mce_severity_amd(struct mce *m, int tolerant, char **msg, bool is_exc
* to at least kill process to prolong system operation.
*/
if (mce_flags.overflow_recov) {
- /* software can try to contain */
- if (!(m->mcgstatus & MCG_STATUS_RIPV) && (ctx == IN_KERNEL))
- return MCE_PANIC_SEVERITY;
-
/* kill current process */
return MCE_AR_SEVERITY;
} else {
--
2.14.1
This is a note to let you know that I've just added the patch titled
vlan: fix a use-after-free in vlan_device_event()
to the 3.18-stable tree which can be found at:
http://www.kernel.org/git/?p=linux/kernel/git/stable/stable-queue.git;a=sum…
The filename of the patch is:
vlan-fix-a-use-after-free-in-vlan_device_event.patch
and it can be found in the queue-3.18 subdirectory.
If you, or anyone else, feels it should not be added to the stable tree,
please let <stable(a)vger.kernel.org> know about it.
>From foo@baz Tue Nov 21 15:37:44 CET 2017
From: Cong Wang <xiyou.wangcong(a)gmail.com>
Date: Thu, 9 Nov 2017 16:43:13 -0800
Subject: vlan: fix a use-after-free in vlan_device_event()
From: Cong Wang <xiyou.wangcong(a)gmail.com>
[ Upstream commit 052d41c01b3a2e3371d66de569717353af489d63 ]
After refcnt reaches zero, vlan_vid_del() could free
dev->vlan_info via RCU:
RCU_INIT_POINTER(dev->vlan_info, NULL);
call_rcu(&vlan_info->rcu, vlan_info_rcu_free);
However, the pointer 'grp' still points to that memory
since it is set before vlan_vid_del():
vlan_info = rtnl_dereference(dev->vlan_info);
if (!vlan_info)
goto out;
grp = &vlan_info->grp;
Depends on when that RCU callback is scheduled, we could
trigger a use-after-free in vlan_group_for_each_dev()
right following this vlan_vid_del().
Fix it by moving vlan_vid_del() before setting grp. This
is also symmetric to the vlan_vid_add() we call in
vlan_device_event().
Reported-by: Fengguang Wu <fengguang.wu(a)intel.com>
Fixes: efc73f4bbc23 ("net: Fix memory leak - vlan_info struct")
Cc: Alexander Duyck <alexander.duyck(a)gmail.com>
Cc: Linus Torvalds <torvalds(a)linux-foundation.org>
Cc: Girish Moodalbail <girish.moodalbail(a)oracle.com>
Signed-off-by: Cong Wang <xiyou.wangcong(a)gmail.com>
Reviewed-by: Girish Moodalbail <girish.moodalbail(a)oracle.com>
Tested-by: Fengguang Wu <fengguang.wu(a)intel.com>
Signed-off-by: David S. Miller <davem(a)davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh(a)linuxfoundation.org>
---
net/8021q/vlan.c | 6 +++---
1 file changed, 3 insertions(+), 3 deletions(-)
--- a/net/8021q/vlan.c
+++ b/net/8021q/vlan.c
@@ -376,6 +376,9 @@ static int vlan_device_event(struct noti
dev->name);
vlan_vid_add(dev, htons(ETH_P_8021Q), 0);
}
+ if (event == NETDEV_DOWN &&
+ (dev->features & NETIF_F_HW_VLAN_CTAG_FILTER))
+ vlan_vid_del(dev, htons(ETH_P_8021Q), 0);
vlan_info = rtnl_dereference(dev->vlan_info);
if (!vlan_info)
@@ -420,9 +423,6 @@ static int vlan_device_event(struct noti
break;
case NETDEV_DOWN:
- if (dev->features & NETIF_F_HW_VLAN_CTAG_FILTER)
- vlan_vid_del(dev, htons(ETH_P_8021Q), 0);
-
/* Put all VLANs for this dev in the down state too. */
vlan_group_for_each_dev(grp, i, vlandev) {
flgs = vlandev->flags;
Patches currently in stable-queue which might be from xiyou.wangcong(a)gmail.com are
queue-3.18/ipv6-dccp-do-not-inherit-ipv6_mc_list-from-parent.patch
queue-3.18/vlan-fix-a-use-after-free-in-vlan_device_event.patch
This is a note to let you know that I've just added the patch titled
tcp: do not mangle skb->cb[] in tcp_make_synack()
to the 3.18-stable tree which can be found at:
http://www.kernel.org/git/?p=linux/kernel/git/stable/stable-queue.git;a=sum…
The filename of the patch is:
tcp-do-not-mangle-skb-cb-in-tcp_make_synack.patch
and it can be found in the queue-3.18 subdirectory.
If you, or anyone else, feels it should not be added to the stable tree,
please let <stable(a)vger.kernel.org> know about it.
>From foo@baz Tue Nov 21 15:37:44 CET 2017
From: Eric Dumazet <edumazet(a)google.com>
Date: Thu, 2 Nov 2017 12:30:25 -0700
Subject: tcp: do not mangle skb->cb[] in tcp_make_synack()
From: Eric Dumazet <edumazet(a)google.com>
[ Upstream commit 3b11775033dc87c3d161996c54507b15ba26414a ]
Christoph Paasch sent a patch to address the following issue :
tcp_make_synack() is leaving some TCP private info in skb->cb[],
then send the packet by other means than tcp_transmit_skb()
tcp_transmit_skb() makes sure to clear skb->cb[] to not confuse
IPv4/IPV6 stacks, but we have no such cleanup for SYNACK.
tcp_make_synack() should not use tcp_init_nondata_skb() :
tcp_init_nondata_skb() really should be limited to skbs put in write/rtx
queues (the ones that are only sent via tcp_transmit_skb())
This patch fixes the issue and should even save few cpu cycles ;)
Fixes: 971f10eca186 ("tcp: better TCP_SKB_CB layout to reduce cache line misses")
Signed-off-by: Eric Dumazet <edumazet(a)google.com>
Reported-by: Christoph Paasch <cpaasch(a)apple.com>
Reviewed-by: Christoph Paasch <cpaasch(a)apple.com>
Signed-off-by: David S. Miller <davem(a)davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh(a)linuxfoundation.org>
---
net/ipv4/tcp_output.c | 9 ++-------
1 file changed, 2 insertions(+), 7 deletions(-)
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -2911,13 +2911,8 @@ struct sk_buff *tcp_make_synack(struct s
tcp_ecn_make_synack(req, th, sk);
th->source = htons(ireq->ir_num);
th->dest = ireq->ir_rmt_port;
- /* Setting of flags are superfluous here for callers (and ECE is
- * not even correctly set)
- */
- tcp_init_nondata_skb(skb, tcp_rsk(req)->snt_isn,
- TCPHDR_SYN | TCPHDR_ACK);
-
- th->seq = htonl(TCP_SKB_CB(skb)->seq);
+ skb->ip_summed = CHECKSUM_PARTIAL;
+ th->seq = htonl(tcp_rsk(req)->snt_isn);
/* XXX data is queued and acked as is. No buffer/window check */
th->ack_seq = htonl(tcp_rsk(req)->rcv_nxt);
Patches currently in stable-queue which might be from edumazet(a)google.com are
queue-3.18/tcp-do-not-mangle-skb-cb-in-tcp_make_synack.patch
queue-3.18/ipv6-dccp-do-not-inherit-ipv6_mc_list-from-parent.patch
This is a note to let you know that I've just added the patch titled
sctp: do not peel off an assoc from one netns to another one
to the 3.18-stable tree which can be found at:
http://www.kernel.org/git/?p=linux/kernel/git/stable/stable-queue.git;a=sum…
The filename of the patch is:
sctp-do-not-peel-off-an-assoc-from-one-netns-to-another-one.patch
and it can be found in the queue-3.18 subdirectory.
If you, or anyone else, feels it should not be added to the stable tree,
please let <stable(a)vger.kernel.org> know about it.
>From foo@baz Tue Nov 21 15:37:44 CET 2017
From: Xin Long <lucien.xin(a)gmail.com>
Date: Tue, 17 Oct 2017 23:26:10 +0800
Subject: sctp: do not peel off an assoc from one netns to another one
From: Xin Long <lucien.xin(a)gmail.com>
[ Upstream commit df80cd9b28b9ebaa284a41df611dbf3a2d05ca74 ]
Now when peeling off an association to the sock in another netns, all
transports in this assoc are not to be rehashed and keep use the old
key in hashtable.
As a transport uses sk->net as the hash key to insert into hashtable,
it would miss removing these transports from hashtable due to the new
netns when closing the sock and all transports are being freeed, then
later an use-after-free issue could be caused when looking up an asoc
and dereferencing those transports.
This is a very old issue since very beginning, ChunYu found it with
syzkaller fuzz testing with this series:
socket$inet6_sctp()
bind$inet6()
sendto$inet6()
unshare(0x40000000)
getsockopt$inet_sctp6_SCTP_GET_ASSOC_ID_LIST()
getsockopt$inet_sctp6_SCTP_SOCKOPT_PEELOFF()
This patch is to block this call when peeling one assoc off from one
netns to another one, so that the netns of all transport would not
go out-sync with the key in hashtable.
Note that this patch didn't fix it by rehashing transports, as it's
difficult to handle the situation when the tuple is already in use
in the new netns. Besides, no one would like to peel off one assoc
to another netns, considering ipaddrs, ifaces, etc. are usually
different.
Reported-by: ChunYu Wang <chunwang(a)redhat.com>
Signed-off-by: Xin Long <lucien.xin(a)gmail.com>
Acked-by: Marcelo Ricardo Leitner <marcelo.leitner(a)gmail.com>
Acked-by: Neil Horman <nhorman(a)tuxdriver.com>
Signed-off-by: David S. Miller <davem(a)davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh(a)linuxfoundation.org>
---
net/sctp/socket.c | 4 ++++
1 file changed, 4 insertions(+)
--- a/net/sctp/socket.c
+++ b/net/sctp/socket.c
@@ -4466,6 +4466,10 @@ int sctp_do_peeloff(struct sock *sk, sct
struct socket *sock;
int err = 0;
+ /* Do not peel off from one netns to another one. */
+ if (!net_eq(current->nsproxy->net_ns, sock_net(sk)))
+ return -EINVAL;
+
if (!asoc)
return -EINVAL;
Patches currently in stable-queue which might be from lucien.xin(a)gmail.com are
queue-3.18/sctp-do-not-peel-off-an-assoc-from-one-netns-to-another-one.patch
This is a note to let you know that I've just added the patch titled
netfilter/ipvs: clear ipvs_property flag when SKB net namespace changed
to the 3.18-stable tree which can be found at:
http://www.kernel.org/git/?p=linux/kernel/git/stable/stable-queue.git;a=sum…
The filename of the patch is:
netfilter-ipvs-clear-ipvs_property-flag-when-skb-net-namespace-changed.patch
and it can be found in the queue-3.18 subdirectory.
If you, or anyone else, feels it should not be added to the stable tree,
please let <stable(a)vger.kernel.org> know about it.
>From foo@baz Tue Nov 21 15:37:44 CET 2017
From: Ye Yin <hustcat(a)gmail.com>
Date: Thu, 26 Oct 2017 16:57:05 +0800
Subject: netfilter/ipvs: clear ipvs_property flag when SKB net namespace changed
From: Ye Yin <hustcat(a)gmail.com>
[ Upstream commit 2b5ec1a5f9738ee7bf8f5ec0526e75e00362c48f ]
When run ipvs in two different network namespace at the same host, and one
ipvs transport network traffic to the other network namespace ipvs.
'ipvs_property' flag will make the second ipvs take no effect. So we should
clear 'ipvs_property' when SKB network namespace changed.
Fixes: 621e84d6f373 ("dev: introduce skb_scrub_packet()")
Signed-off-by: Ye Yin <hustcat(a)gmail.com>
Signed-off-by: Wei Zhou <chouryzhou(a)gmail.com>
Signed-off-by: Julian Anastasov <ja(a)ssi.bg>
Signed-off-by: Simon Horman <horms(a)verge.net.au>
Signed-off-by: David S. Miller <davem(a)davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh(a)linuxfoundation.org>
---
include/linux/skbuff.h | 7 +++++++
net/core/skbuff.c | 1 +
2 files changed, 8 insertions(+)
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -3117,6 +3117,13 @@ static inline void nf_reset_trace(struct
#endif
}
+static inline void ipvs_reset(struct sk_buff *skb)
+{
+#if IS_ENABLED(CONFIG_IP_VS)
+ skb->ipvs_property = 0;
+#endif
+}
+
/* Note: This doesn't put any conntrack and bridge info in dst. */
static inline void __nf_copy(struct sk_buff *dst, const struct sk_buff *src,
bool copy)
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -4069,6 +4069,7 @@ void skb_scrub_packet(struct sk_buff *sk
if (!xnet)
return;
+ ipvs_reset(skb);
skb_orphan(skb);
skb->mark = 0;
}
Patches currently in stable-queue which might be from hustcat(a)gmail.com are
queue-3.18/netfilter-ipvs-clear-ipvs_property-flag-when-skb-net-namespace-changed.patch
This is a note to let you know that I've just added the patch titled
net/sctp: Always set scope_id in sctp_inet6_skb_msgname
to the 3.18-stable tree which can be found at:
http://www.kernel.org/git/?p=linux/kernel/git/stable/stable-queue.git;a=sum…
The filename of the patch is:
net-sctp-always-set-scope_id-in-sctp_inet6_skb_msgname.patch
and it can be found in the queue-3.18 subdirectory.
If you, or anyone else, feels it should not be added to the stable tree,
please let <stable(a)vger.kernel.org> know about it.
>From foo@baz Tue Nov 21 15:28:16 CET 2017
From: "Eric W. Biederman" <ebiederm(a)xmission.com>
Date: Wed, 15 Nov 2017 22:17:48 -0600
Subject: net/sctp: Always set scope_id in sctp_inet6_skb_msgname
From: "Eric W. Biederman" <ebiederm(a)xmission.com>
[ Upstream commit 7c8a61d9ee1df0fb4747879fa67a99614eb62fec ]
Alexandar Potapenko while testing the kernel with KMSAN and syzkaller
discovered that in some configurations sctp would leak 4 bytes of
kernel stack.
Working with his reproducer I discovered that those 4 bytes that
are leaked is the scope id of an ipv6 address returned by recvmsg.
With a little code inspection and a shrewd guess I discovered that
sctp_inet6_skb_msgname only initializes the scope_id field for link
local ipv6 addresses to the interface index the link local address
pertains to instead of initializing the scope_id field for all ipv6
addresses.
That is almost reasonable as scope_id's are meaniningful only for link
local addresses. Set the scope_id in all other cases to 0 which is
not a valid interface index to make it clear there is nothing useful
in the scope_id field.
There should be no danger of breaking userspace as the stack leak
guaranteed that previously meaningless random data was being returned.
Fixes: 372f525b495c ("SCTP: Resync with LKSCTP tree.")
History-tree: https://git.kernel.org/pub/scm/linux/kernel/git/tglx/history.git
Reported-by: Alexander Potapenko <glider(a)google.com>
Tested-by: Alexander Potapenko <glider(a)google.com>
Signed-off-by: "Eric W. Biederman" <ebiederm(a)xmission.com>
Signed-off-by: David S. Miller <davem(a)davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh(a)linuxfoundation.org>
---
net/sctp/ipv6.c | 2 ++
1 file changed, 2 insertions(+)
--- a/net/sctp/ipv6.c
+++ b/net/sctp/ipv6.c
@@ -805,6 +805,8 @@ static void sctp_inet6_skb_msgname(struc
if (ipv6_addr_type(&addr->v6.sin6_addr) & IPV6_ADDR_LINKLOCAL) {
struct sctp_ulpevent *ev = sctp_skb2event(skb);
addr->v6.sin6_scope_id = ev->iif;
+ } else {
+ addr->v6.sin6_scope_id = 0;
}
}
Patches currently in stable-queue which might be from ebiederm(a)xmission.com are
queue-3.18/net-sctp-always-set-scope_id-in-sctp_inet6_skb_msgname.patch
This is a note to let you know that I've just added the patch titled
fealnx: Fix building error on MIPS
to the 3.18-stable tree which can be found at:
http://www.kernel.org/git/?p=linux/kernel/git/stable/stable-queue.git;a=sum…
The filename of the patch is:
fealnx-fix-building-error-on-mips.patch
and it can be found in the queue-3.18 subdirectory.
If you, or anyone else, feels it should not be added to the stable tree,
please let <stable(a)vger.kernel.org> know about it.
>From foo@baz Tue Nov 21 15:37:44 CET 2017
From: Huacai Chen <chenhc(a)lemote.com>
Date: Thu, 16 Nov 2017 11:07:15 +0800
Subject: fealnx: Fix building error on MIPS
From: Huacai Chen <chenhc(a)lemote.com>
[ Upstream commit cc54c1d32e6a4bb3f116721abf900513173e4d02 ]
This patch try to fix the building error on MIPS. The reason is MIPS
has already defined the LONG macro, which conflicts with the LONG enum
in drivers/net/ethernet/fealnx.c.
Signed-off-by: Huacai Chen <chenhc(a)lemote.com>
Signed-off-by: David S. Miller <davem(a)davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh(a)linuxfoundation.org>
---
drivers/net/ethernet/fealnx.c | 6 +++---
1 file changed, 3 insertions(+), 3 deletions(-)
--- a/drivers/net/ethernet/fealnx.c
+++ b/drivers/net/ethernet/fealnx.c
@@ -257,8 +257,8 @@ enum rx_desc_status_bits {
RXFSD = 0x00000800, /* first descriptor */
RXLSD = 0x00000400, /* last descriptor */
ErrorSummary = 0x80, /* error summary */
- RUNT = 0x40, /* runt packet received */
- LONG = 0x20, /* long packet received */
+ RUNTPKT = 0x40, /* runt packet received */
+ LONGPKT = 0x20, /* long packet received */
FAE = 0x10, /* frame align error */
CRC = 0x08, /* crc error */
RXER = 0x04, /* receive error */
@@ -1633,7 +1633,7 @@ static int netdev_rx(struct net_device *
dev->name, rx_status);
dev->stats.rx_errors++; /* end of a packet. */
- if (rx_status & (LONG | RUNT))
+ if (rx_status & (LONGPKT | RUNTPKT))
dev->stats.rx_length_errors++;
if (rx_status & RXER)
dev->stats.rx_frame_errors++;
Patches currently in stable-queue which might be from chenhc(a)lemote.com are
queue-3.18/fealnx-fix-building-error-on-mips.patch
This is a note to let you know that I've just added the patch titled
af_netlink: ensure that NLMSG_DONE never fails in dumps
to the 3.18-stable tree which can be found at:
http://www.kernel.org/git/?p=linux/kernel/git/stable/stable-queue.git;a=sum…
The filename of the patch is:
af_netlink-ensure-that-nlmsg_done-never-fails-in-dumps.patch
and it can be found in the queue-3.18 subdirectory.
If you, or anyone else, feels it should not be added to the stable tree,
please let <stable(a)vger.kernel.org> know about it.
>From foo@baz Tue Nov 21 15:37:44 CET 2017
From: "Jason A. Donenfeld" <Jason(a)zx2c4.com>
Date: Thu, 9 Nov 2017 13:04:44 +0900
Subject: af_netlink: ensure that NLMSG_DONE never fails in dumps
From: "Jason A. Donenfeld" <Jason(a)zx2c4.com>
[ Upstream commit 0642840b8bb008528dbdf929cec9f65ac4231ad0 ]
The way people generally use netlink_dump is that they fill in the skb
as much as possible, breaking when nla_put returns an error. Then, they
get called again and start filling out the next skb, and again, and so
forth. The mechanism at work here is the ability for the iterative
dumping function to detect when the skb is filled up and not fill it
past the brim, waiting for a fresh skb for the rest of the data.
However, if the attributes are small and nicely packed, it is possible
that a dump callback function successfully fills in attributes until the
skb is of size 4080 (libmnl's default page-sized receive buffer size).
The dump function completes, satisfied, and then, if it happens to be
that this is actually the last skb, and no further ones are to be sent,
then netlink_dump will add on the NLMSG_DONE part:
nlh = nlmsg_put_answer(skb, cb, NLMSG_DONE, sizeof(len), NLM_F_MULTI);
It is very important that netlink_dump does this, of course. However, in
this example, that call to nlmsg_put_answer will fail, because the
previous filling by the dump function did not leave it enough room. And
how could it possibly have done so? All of the nla_put variety of
functions simply check to see if the skb has enough tailroom,
independent of the context it is in.
In order to keep the important assumptions of all netlink dump users, it
is therefore important to give them an skb that has this end part of the
tail already reserved, so that the call to nlmsg_put_answer does not
fail. Otherwise, library authors are forced to find some bizarre sized
receive buffer that has a large modulo relative to the common sizes of
messages received, which is ugly and buggy.
This patch thus saves the NLMSG_DONE for an additional message, for the
case that things are dangerously close to the brim. This requires
keeping track of the errno from ->dump() across calls.
Signed-off-by: Jason A. Donenfeld <Jason(a)zx2c4.com>
Signed-off-by: David S. Miller <davem(a)davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh(a)linuxfoundation.org>
---
net/netlink/af_netlink.c | 17 +++++++++++------
net/netlink/af_netlink.h | 1 +
2 files changed, 12 insertions(+), 6 deletions(-)
--- a/net/netlink/af_netlink.c
+++ b/net/netlink/af_netlink.c
@@ -1927,7 +1927,7 @@ static int netlink_dump(struct sock *sk)
struct sk_buff *skb = NULL;
struct nlmsghdr *nlh;
struct module *module;
- int len, err = -ENOBUFS;
+ int err = -ENOBUFS;
int alloc_size;
mutex_lock(nlk->cb_mutex);
@@ -1965,9 +1965,11 @@ static int netlink_dump(struct sock *sk)
goto errout_skb;
netlink_skb_set_owner_r(skb, sk);
- len = cb->dump(skb, cb);
+ if (nlk->dump_done_errno > 0)
+ nlk->dump_done_errno = cb->dump(skb, cb);
- if (len > 0) {
+ if (nlk->dump_done_errno > 0 ||
+ skb_tailroom(skb) < nlmsg_total_size(sizeof(nlk->dump_done_errno))) {
mutex_unlock(nlk->cb_mutex);
if (sk_filter(sk, skb))
@@ -1977,13 +1979,15 @@ static int netlink_dump(struct sock *sk)
return 0;
}
- nlh = nlmsg_put_answer(skb, cb, NLMSG_DONE, sizeof(len), NLM_F_MULTI);
- if (!nlh)
+ nlh = nlmsg_put_answer(skb, cb, NLMSG_DONE,
+ sizeof(nlk->dump_done_errno), NLM_F_MULTI);
+ if (WARN_ON(!nlh))
goto errout_skb;
nl_dump_check_consistent(cb, nlh);
- memcpy(nlmsg_data(nlh), &len, sizeof(len));
+ memcpy(nlmsg_data(nlh), &nlk->dump_done_errno,
+ sizeof(nlk->dump_done_errno));
if (sk_filter(sk, skb))
kfree_skb(skb);
@@ -2048,6 +2052,7 @@ int __netlink_dump_start(struct sock *ss
cb->skb = skb;
nlk->cb_running = true;
+ nlk->dump_done_errno = INT_MAX;
mutex_unlock(nlk->cb_mutex);
--- a/net/netlink/af_netlink.h
+++ b/net/netlink/af_netlink.h
@@ -35,6 +35,7 @@ struct netlink_sock {
size_t max_recvmsg_len;
wait_queue_head_t wait;
bool cb_running;
+ int dump_done_errno;
struct netlink_callback cb;
struct mutex *cb_mutex;
struct mutex cb_def_mutex;
Patches currently in stable-queue which might be from Jason(a)zx2c4.com are
queue-3.18/af_netlink-ensure-that-nlmsg_done-never-fails-in-dumps.patch
This is a note to let you know that I've just added the patch titled
vlan: fix a use-after-free in vlan_device_event()
to the 4.4-stable tree which can be found at:
http://www.kernel.org/git/?p=linux/kernel/git/stable/stable-queue.git;a=sum…
The filename of the patch is:
vlan-fix-a-use-after-free-in-vlan_device_event.patch
and it can be found in the queue-4.4 subdirectory.
If you, or anyone else, feels it should not be added to the stable tree,
please let <stable(a)vger.kernel.org> know about it.
>From foo@baz Tue Nov 21 15:28:16 CET 2017
From: Cong Wang <xiyou.wangcong(a)gmail.com>
Date: Thu, 9 Nov 2017 16:43:13 -0800
Subject: vlan: fix a use-after-free in vlan_device_event()
From: Cong Wang <xiyou.wangcong(a)gmail.com>
[ Upstream commit 052d41c01b3a2e3371d66de569717353af489d63 ]
After refcnt reaches zero, vlan_vid_del() could free
dev->vlan_info via RCU:
RCU_INIT_POINTER(dev->vlan_info, NULL);
call_rcu(&vlan_info->rcu, vlan_info_rcu_free);
However, the pointer 'grp' still points to that memory
since it is set before vlan_vid_del():
vlan_info = rtnl_dereference(dev->vlan_info);
if (!vlan_info)
goto out;
grp = &vlan_info->grp;
Depends on when that RCU callback is scheduled, we could
trigger a use-after-free in vlan_group_for_each_dev()
right following this vlan_vid_del().
Fix it by moving vlan_vid_del() before setting grp. This
is also symmetric to the vlan_vid_add() we call in
vlan_device_event().
Reported-by: Fengguang Wu <fengguang.wu(a)intel.com>
Fixes: efc73f4bbc23 ("net: Fix memory leak - vlan_info struct")
Cc: Alexander Duyck <alexander.duyck(a)gmail.com>
Cc: Linus Torvalds <torvalds(a)linux-foundation.org>
Cc: Girish Moodalbail <girish.moodalbail(a)oracle.com>
Signed-off-by: Cong Wang <xiyou.wangcong(a)gmail.com>
Reviewed-by: Girish Moodalbail <girish.moodalbail(a)oracle.com>
Tested-by: Fengguang Wu <fengguang.wu(a)intel.com>
Signed-off-by: David S. Miller <davem(a)davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh(a)linuxfoundation.org>
---
net/8021q/vlan.c | 6 +++---
1 file changed, 3 insertions(+), 3 deletions(-)
--- a/net/8021q/vlan.c
+++ b/net/8021q/vlan.c
@@ -376,6 +376,9 @@ static int vlan_device_event(struct noti
dev->name);
vlan_vid_add(dev, htons(ETH_P_8021Q), 0);
}
+ if (event == NETDEV_DOWN &&
+ (dev->features & NETIF_F_HW_VLAN_CTAG_FILTER))
+ vlan_vid_del(dev, htons(ETH_P_8021Q), 0);
vlan_info = rtnl_dereference(dev->vlan_info);
if (!vlan_info)
@@ -423,9 +426,6 @@ static int vlan_device_event(struct noti
struct net_device *tmp;
LIST_HEAD(close_list);
- if (dev->features & NETIF_F_HW_VLAN_CTAG_FILTER)
- vlan_vid_del(dev, htons(ETH_P_8021Q), 0);
-
/* Put all VLANs for this dev in the down state too. */
vlan_group_for_each_dev(grp, i, vlandev) {
flgs = vlandev->flags;
Patches currently in stable-queue which might be from xiyou.wangcong(a)gmail.com are
queue-4.4/vlan-fix-a-use-after-free-in-vlan_device_event.patch
This is a note to let you know that I've just added the patch titled
tcp: do not mangle skb->cb[] in tcp_make_synack()
to the 4.4-stable tree which can be found at:
http://www.kernel.org/git/?p=linux/kernel/git/stable/stable-queue.git;a=sum…
The filename of the patch is:
tcp-do-not-mangle-skb-cb-in-tcp_make_synack.patch
and it can be found in the queue-4.4 subdirectory.
If you, or anyone else, feels it should not be added to the stable tree,
please let <stable(a)vger.kernel.org> know about it.
>From foo@baz Tue Nov 21 15:28:16 CET 2017
From: Eric Dumazet <edumazet(a)google.com>
Date: Thu, 2 Nov 2017 12:30:25 -0700
Subject: tcp: do not mangle skb->cb[] in tcp_make_synack()
From: Eric Dumazet <edumazet(a)google.com>
[ Upstream commit 3b11775033dc87c3d161996c54507b15ba26414a ]
Christoph Paasch sent a patch to address the following issue :
tcp_make_synack() is leaving some TCP private info in skb->cb[],
then send the packet by other means than tcp_transmit_skb()
tcp_transmit_skb() makes sure to clear skb->cb[] to not confuse
IPv4/IPV6 stacks, but we have no such cleanup for SYNACK.
tcp_make_synack() should not use tcp_init_nondata_skb() :
tcp_init_nondata_skb() really should be limited to skbs put in write/rtx
queues (the ones that are only sent via tcp_transmit_skb())
This patch fixes the issue and should even save few cpu cycles ;)
Fixes: 971f10eca186 ("tcp: better TCP_SKB_CB layout to reduce cache line misses")
Signed-off-by: Eric Dumazet <edumazet(a)google.com>
Reported-by: Christoph Paasch <cpaasch(a)apple.com>
Reviewed-by: Christoph Paasch <cpaasch(a)apple.com>
Signed-off-by: David S. Miller <davem(a)davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh(a)linuxfoundation.org>
---
net/ipv4/tcp_output.c | 9 ++-------
1 file changed, 2 insertions(+), 7 deletions(-)
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -3018,13 +3018,8 @@ struct sk_buff *tcp_make_synack(const st
tcp_ecn_make_synack(req, th);
th->source = htons(ireq->ir_num);
th->dest = ireq->ir_rmt_port;
- /* Setting of flags are superfluous here for callers (and ECE is
- * not even correctly set)
- */
- tcp_init_nondata_skb(skb, tcp_rsk(req)->snt_isn,
- TCPHDR_SYN | TCPHDR_ACK);
-
- th->seq = htonl(TCP_SKB_CB(skb)->seq);
+ skb->ip_summed = CHECKSUM_PARTIAL;
+ th->seq = htonl(tcp_rsk(req)->snt_isn);
/* XXX data is queued and acked as is. No buffer/window check */
th->ack_seq = htonl(tcp_rsk(req)->rcv_nxt);
Patches currently in stable-queue which might be from edumazet(a)google.com are
queue-4.4/tcp-do-not-mangle-skb-cb-in-tcp_make_synack.patch