The original upstream fix, commit 55e56f06ed71 "dax: Don't access a freed
inode", prompted an immediate cleanup request. Now that the cleanup has
landed, commit d8a706414af4 "dax: Use non-exclusive wait in
wait_entry_unlocked()", backport them both to -stable.
---
Dan Williams (1):
dax: Use non-exclusive wait in wait_entry_unlocked()
Matthew Wilcox (1):
dax: Don't access a freed inode
fs/dax.c | 69 ++++++++++++++++++++++++++++----------------------------------
1 file changed, 31 insertions(+), 38 deletions(-)
On Sun, Jan 6, 2019 at 4:58 AM Sebastian Kemper <sebastian_ml(a)gmx.net> wrote:
> Sorry for probably breaking the thread. This is really in reply to Paul
> Aubrich's patch titled "smb3: fix large reads on encrypted connections".
> I browsed linux-cifs and found Paul's patch. With the patch applied the
> problem is gone, kernel 4.19 works like 4.9 did without encryption.
>
> The patch title suggests that only large reads are not working. Well,
> considering the cutoff is between 13K and 17K I'd say that encryption on
> 4.19 for cifs can be considered broken without this patch. It'd be cool
> if this could make it to stable kernels pronto.
Yes - it is a a very important patch, and is merged in and marked for stable
(for 4.19 and 4.20 kernels). Hopefully can be included soon.
The good news though is that with the improved test automation that
Paulo/Aurelien/Ronnie have been doing, we have added more to
the automated xfstests run before checkin and more importantly
added various different mount configrations that are much
broader (used to be only a very small set that ran) and so would have
caught this (and also other regressions that had made it through over the
past couple years).
Now our next challenge is figuring out how to get automated tests for smb3
to run reasonably regularly against some of the stable kernels and also against
the full backports of cifs.ko to earlier kernels etc. which some need to get the
many security, performance and functional improvements over the past year.
--
Thanks,
Steve
The patch below does not apply to the 4.4-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
>From 18f2c4fcebf2582f96cbd5f2238f4f354a0e4847 Mon Sep 17 00:00:00 2001
From: Theodore Ts'o <tytso(a)mit.edu>
Date: Wed, 19 Dec 2018 14:36:58 -0500
Subject: [PATCH] ext4: check for shutdown and r/o file system in
ext4_write_inode()
If the file system has been shut down or is read-only, then
ext4_write_inode() needs to bail out early.
Also use jbd2_complete_transaction() instead of ext4_force_commit() so
we only force a commit if it is needed.
Signed-off-by: Theodore Ts'o <tytso(a)mit.edu>
Cc: stable(a)kernel.org
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 664b434ba836..9affabd07682 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -5400,9 +5400,13 @@ int ext4_write_inode(struct inode *inode, struct writeback_control *wbc)
{
int err;
- if (WARN_ON_ONCE(current->flags & PF_MEMALLOC))
+ if (WARN_ON_ONCE(current->flags & PF_MEMALLOC) ||
+ sb_rdonly(inode->i_sb))
return 0;
+ if (unlikely(ext4_forced_shutdown(EXT4_SB(inode->i_sb))))
+ return -EIO;
+
if (EXT4_SB(inode->i_sb)->s_journal) {
if (ext4_journal_current_handle()) {
jbd_debug(1, "called recursively, non-PF_MEMALLOC!\n");
@@ -5418,7 +5422,8 @@ int ext4_write_inode(struct inode *inode, struct writeback_control *wbc)
if (wbc->sync_mode != WB_SYNC_ALL || wbc->for_sync)
return 0;
- err = ext4_force_commit(inode->i_sb);
+ err = jbd2_complete_transaction(EXT4_SB(inode->i_sb)->s_journal,
+ EXT4_I(inode)->i_sync_tid);
} else {
struct ext4_iloc iloc;
Hi,
[This is an automated email]
This commit has been processed because it contains a "Fixes:" tag,
fixing commit: 277e4ab7d530 SUNRPC: Simplify TCP receive code by switching to using iterators.
The bot has tested the following trees: v4.20.0,
v4.20.0: Build failed! Errors:
net/sunrpc/xprtsock.c:396:24: error: ‘struct bio_vec’ has no member named ‘page’; did you mean ‘bv_page’?
How should we proceed with this patch?
--
Thanks,
Sasha
On Tue, Dec 11, 2018 at 03:29:31PM -0500, Dietmar May wrote:
>Sasha,
>
>I've verified that 4.9.143 no longer exhibits this problem.
Thanks for confirming!
>The revert hasn't shown up in 4.4 yet; but I'll verify once merged there.
There was no 4.4 release yet; it's coming, I promise :)
--
Thanks,
Sasha
The patch below does not apply to the 4.4-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
>From 0568e82dbe2510fc1fa664f58e5c997d3f1e649e Mon Sep 17 00:00:00 2001
From: Josef Bacik <jbacik(a)fb.com>
Date: Fri, 30 Nov 2018 11:52:14 -0500
Subject: [PATCH] btrfs: run delayed items before dropping the snapshot
With my delayed refs patches in place we started seeing a large amount
of aborts in __btrfs_free_extent:
BTRFS error (device sdb1): unable to find ref byte nr 91947008 parent 0 root 35964 owner 1 offset 0
Call Trace:
? btrfs_merge_delayed_refs+0xaf/0x340
__btrfs_run_delayed_refs+0x6ea/0xfc0
? btrfs_set_path_blocking+0x31/0x60
btrfs_run_delayed_refs+0xeb/0x180
btrfs_commit_transaction+0x179/0x7f0
? btrfs_check_space_for_delayed_refs+0x30/0x50
? should_end_transaction.isra.19+0xe/0x40
btrfs_drop_snapshot+0x41c/0x7c0
btrfs_clean_one_deleted_snapshot+0xb5/0xd0
cleaner_kthread+0xf6/0x120
kthread+0xf8/0x130
? btree_invalidatepage+0x90/0x90
? kthread_bind+0x10/0x10
ret_from_fork+0x35/0x40
This was because btrfs_drop_snapshot depends on the root not being
modified while it's dropping the snapshot. It will unlock the root node
(and really every node) as it walks down the tree, only to re-lock it
when it needs to do something. This is a problem because if we modify
the tree we could cow a block in our path, which frees our reference to
that block. Then once we get back to that shared block we'll free our
reference to it again, and get ENOENT when trying to lookup our extent
reference to that block in __btrfs_free_extent.
This is ultimately happening because we have delayed items left to be
processed for our deleted snapshot _after_ all of the inodes are closed
for the snapshot. We only run the delayed inode item if we're deleting
the inode, and even then we do not run the delayed insertions or delayed
removals. These can be run at any point after our final inode does its
last iput, which is what triggers the snapshot deletion. We can end up
with the snapshot deletion happening and then have the delayed items run
on that file system, resulting in the above problem.
This problem has existed forever, however my patches made it much easier
to hit as I wake up the cleaner much more often to deal with delayed
iputs, which made us more likely to start the snapshot dropping work
before the transaction commits, which is when the delayed items would
generally be run. Before, generally speaking, we would run the delayed
items, commit the transaction, and wakeup the cleaner thread to start
deleting snapshots, which means we were less likely to hit this problem.
You could still hit it if you had multiple snapshots to be deleted and
ended up with lots of delayed items, but it was definitely harder.
Fix for now by simply running all the delayed items before starting to
drop the snapshot. We could make this smarter in the future by making
the delayed items per-root, and then simply drop any delayed items for
roots that we are going to delete. But for now just a quick and easy
solution is the safest.
CC: stable(a)vger.kernel.org # 4.4+
Reviewed-by: Filipe Manana <fdmanana(a)suse.com>
Signed-off-by: Josef Bacik <josef(a)toxicpanda.com>
Signed-off-by: David Sterba <dsterba(a)suse.com>
diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index de21e0c93eb6..8a9ce33dfdbc 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -9272,6 +9272,10 @@ int btrfs_drop_snapshot(struct btrfs_root *root,
goto out_free;
}
+ err = btrfs_run_delayed_items(trans);
+ if (err)
+ goto out_end_trans;
+
if (block_rsv)
trans->block_rsv = block_rsv;
The patch below does not apply to the 4.4-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
>From 41bd60676923822de1df2c50b3f9a10171f4338a Mon Sep 17 00:00:00 2001
From: Filipe Manana <fdmanana(a)suse.com>
Date: Wed, 28 Nov 2018 14:54:28 +0000
Subject: [PATCH] Btrfs: fix fsync of files with multiple hard links in new
directories
The log tree has a long standing problem that when a file is fsync'ed we
only check for new ancestors, created in the current transaction, by
following only the hard link for which the fsync was issued. We follow the
ancestors using the VFS' dget_parent() API. This means that if we create a
new link for a file in a directory that is new (or in an any other new
ancestor directory) and then fsync the file using an old hard link, we end
up not logging the new ancestor, and on log replay that new hard link and
ancestor do not exist. In some cases, involving renames, the file will not
exist at all.
Example:
mkfs.btrfs -f /dev/sdb
mount /dev/sdb /mnt
mkdir /mnt/A
touch /mnt/foo
ln /mnt/foo /mnt/A/bar
xfs_io -c fsync /mnt/foo
<power failure>
In this example after log replay only the hard link named 'foo' exists
and directory A does not exist, which is unexpected. In other major linux
filesystems, such as ext4, xfs and f2fs for example, both hard links exist
and so does directory A after mounting again the filesystem.
Checking if any new ancestors are new and need to be logged was added in
2009 by commit 12fcfd22fe5b ("Btrfs: tree logging unlink/rename fixes"),
however only for the ancestors of the hard link (dentry) for which the
fsync was issued, instead of checking for all ancestors for all of the
inode's hard links.
So fix this by tracking the id of the last transaction where a hard link
was created for an inode and then on fsync fallback to a full transaction
commit when an inode has more than one hard link and at least one new hard
link was created in the current transaction. This is the simplest solution
since this is not a common use case (adding frequently hard links for
which there's an ancestor created in the current transaction and then
fsync the file). In case it ever becomes a common use case, a solution
that consists of iterating the fs/subvol btree for each hard link and
check if any ancestor is new, could be implemented.
This solves many unexpected scenarios reported by Jayashree Mohan and
Vijay Chidambaram, and for which there is a new test case for fstests
under review.
Fixes: 12fcfd22fe5b ("Btrfs: tree logging unlink/rename fixes")
CC: stable(a)vger.kernel.org # 4.4+
Reported-by: Vijay Chidambaram <vvijay03(a)gmail.com>
Reported-by: Jayashree Mohan <jayashree2912(a)gmail.com>
Signed-off-by: Filipe Manana <fdmanana(a)suse.com>
Signed-off-by: David Sterba <dsterba(a)suse.com>
diff --git a/fs/btrfs/btrfs_inode.h b/fs/btrfs/btrfs_inode.h
index fc25607304f2..6f5d07415dab 100644
--- a/fs/btrfs/btrfs_inode.h
+++ b/fs/btrfs/btrfs_inode.h
@@ -147,6 +147,12 @@ struct btrfs_inode {
*/
u64 last_unlink_trans;
+ /*
+ * Track the transaction id of the last transaction used to create a
+ * hard link for the inode. This is used by the log tree (fsync).
+ */
+ u64 last_link_trans;
+
/*
* Number of bytes outstanding that are going to need csums. This is
* used in ENOSPC accounting.
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index d54bdef16d8d..b4129d9072ec 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -3658,6 +3658,21 @@ static int btrfs_read_locked_inode(struct inode *inode,
* inode is not a directory, logging its parent unnecessarily.
*/
BTRFS_I(inode)->last_unlink_trans = BTRFS_I(inode)->last_trans;
+ /*
+ * Similar reasoning for last_link_trans, needs to be set otherwise
+ * for a case like the following:
+ *
+ * mkdir A
+ * touch foo
+ * ln foo A/bar
+ * echo 2 > /proc/sys/vm/drop_caches
+ * fsync foo
+ * <power failure>
+ *
+ * Would result in link bar and directory A not existing after the power
+ * failure.
+ */
+ BTRFS_I(inode)->last_link_trans = BTRFS_I(inode)->last_trans;
path->slots[0]++;
if (inode->i_nlink != 1 ||
@@ -6597,6 +6612,7 @@ static int btrfs_link(struct dentry *old_dentry, struct inode *dir,
if (err)
goto fail;
}
+ BTRFS_I(inode)->last_link_trans = trans->transid;
d_instantiate(dentry, inode);
ret = btrfs_log_new_name(trans, BTRFS_I(inode), NULL, parent,
true, NULL);
@@ -9123,6 +9139,7 @@ struct inode *btrfs_alloc_inode(struct super_block *sb)
ei->index_cnt = (u64)-1;
ei->dir_index = 0;
ei->last_unlink_trans = 0;
+ ei->last_link_trans = 0;
ei->last_log_commit = 0;
spin_lock_init(&ei->lock);
diff --git a/fs/btrfs/tree-log.c b/fs/btrfs/tree-log.c
index 013d0abcd46b..5baad9bebc62 100644
--- a/fs/btrfs/tree-log.c
+++ b/fs/btrfs/tree-log.c
@@ -5758,6 +5758,22 @@ static int btrfs_log_inode_parent(struct btrfs_trans_handle *trans,
goto end_trans;
}
+ /*
+ * If a new hard link was added to the inode in the current transaction
+ * and its link count is now greater than 1, we need to fallback to a
+ * transaction commit, otherwise we can end up not logging all its new
+ * parents for all the hard links. Here just from the dentry used to
+ * fsync, we can not visit the ancestor inodes for all the other hard
+ * links to figure out if any is new, so we fallback to a transaction
+ * commit (instead of adding a lot of complexity of scanning a btree,
+ * since this scenario is not a common use case).
+ */
+ if (inode->vfs_inode.i_nlink > 1 &&
+ inode->last_link_trans > last_committed) {
+ ret = -EMLINK;
+ goto end_trans;
+ }
+
while (1) {
if (!parent || d_really_is_negative(parent) || sb != parent->d_sb)
break;
The patch below does not apply to the 4.20-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
>From 9349e23907be1954ccdf6d771d640e2788da1643 Mon Sep 17 00:00:00 2001
From: "Dmitry V. Levin" <ldv(a)altlinux.org>
Date: Thu, 1 Nov 2018 14:03:08 +0300
Subject: [PATCH] uapi: fix linux/kfd_ioctl.h userspace compilation errors
Consistently use types provided by <linux/types.h> via <drm/drm.h>
to fix the following linux/kfd_ioctl.h userspace compilation errors:
/usr/include/linux/kfd_ioctl.h:250:2: error: unknown type name 'uint32_t'
uint32_t reset_type;
/usr/include/linux/kfd_ioctl.h:251:2: error: unknown type name 'uint32_t'
uint32_t reset_cause;
/usr/include/linux/kfd_ioctl.h:252:2: error: unknown type name 'uint32_t'
uint32_t memory_lost;
/usr/include/linux/kfd_ioctl.h:253:2: error: unknown type name 'uint32_t'
uint32_t gpu_id;
Fixes: 0c119abad7f0d ("drm/amd: Add kfd ioctl defines for hw_exception event")
Cc: <stable(a)vger.kernel.org> # v4.19
Signed-off-by: Dmitry V. Levin <ldv(a)altlinux.org>
Reviewed-by: Felix Kuehling <Felix.Kuehling(a)amd.com>
Signed-off-by: Alex Deucher <alexander.deucher(a)amd.com>
diff --git a/include/uapi/linux/kfd_ioctl.h b/include/uapi/linux/kfd_ioctl.h
index f5ff8a76e208..dae897f38e59 100644
--- a/include/uapi/linux/kfd_ioctl.h
+++ b/include/uapi/linux/kfd_ioctl.h
@@ -255,10 +255,10 @@ struct kfd_hsa_memory_exception_data {
/* hw exception data */
struct kfd_hsa_hw_exception_data {
- uint32_t reset_type;
- uint32_t reset_cause;
- uint32_t memory_lost;
- uint32_t gpu_id;
+ __u32 reset_type;
+ __u32 reset_cause;
+ __u32 memory_lost;
+ __u32 gpu_id;
};
/* Event data */
The patch below does not apply to the 4.19-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
>From 9349e23907be1954ccdf6d771d640e2788da1643 Mon Sep 17 00:00:00 2001
From: "Dmitry V. Levin" <ldv(a)altlinux.org>
Date: Thu, 1 Nov 2018 14:03:08 +0300
Subject: [PATCH] uapi: fix linux/kfd_ioctl.h userspace compilation errors
Consistently use types provided by <linux/types.h> via <drm/drm.h>
to fix the following linux/kfd_ioctl.h userspace compilation errors:
/usr/include/linux/kfd_ioctl.h:250:2: error: unknown type name 'uint32_t'
uint32_t reset_type;
/usr/include/linux/kfd_ioctl.h:251:2: error: unknown type name 'uint32_t'
uint32_t reset_cause;
/usr/include/linux/kfd_ioctl.h:252:2: error: unknown type name 'uint32_t'
uint32_t memory_lost;
/usr/include/linux/kfd_ioctl.h:253:2: error: unknown type name 'uint32_t'
uint32_t gpu_id;
Fixes: 0c119abad7f0d ("drm/amd: Add kfd ioctl defines for hw_exception event")
Cc: <stable(a)vger.kernel.org> # v4.19
Signed-off-by: Dmitry V. Levin <ldv(a)altlinux.org>
Reviewed-by: Felix Kuehling <Felix.Kuehling(a)amd.com>
Signed-off-by: Alex Deucher <alexander.deucher(a)amd.com>
diff --git a/include/uapi/linux/kfd_ioctl.h b/include/uapi/linux/kfd_ioctl.h
index f5ff8a76e208..dae897f38e59 100644
--- a/include/uapi/linux/kfd_ioctl.h
+++ b/include/uapi/linux/kfd_ioctl.h
@@ -255,10 +255,10 @@ struct kfd_hsa_memory_exception_data {
/* hw exception data */
struct kfd_hsa_hw_exception_data {
- uint32_t reset_type;
- uint32_t reset_cause;
- uint32_t memory_lost;
- uint32_t gpu_id;
+ __u32 reset_type;
+ __u32 reset_cause;
+ __u32 memory_lost;
+ __u32 gpu_id;
};
/* Event data */
The patch below does not apply to the 4.19-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
>From 2cd4bd192ee94848695c1c052d87913260e10f36 Mon Sep 17 00:00:00 2001
From: Ram Pai <linuxram(a)us.ibm.com>
Date: Thu, 20 Dec 2018 12:03:30 -0800
Subject: [PATCH] powerpc/pkeys: Fix handling of pkey state across fork()
Protection key tracking information is not copied over to the
mm_struct of the child during fork(). This can cause the child to
erroneously allocate keys that were already allocated. Any allocated
execute-only key is lost aswell.
Add code; called by dup_mmap(), to copy the pkey state from parent to
child explicitly.
This problem was originally found by Dave Hansen on x86, which turns
out to be a problem on powerpc aswell.
Fixes: cf43d3b26452 ("powerpc: Enable pkey subsystem")
Cc: stable(a)vger.kernel.org # v4.16+
Reviewed-by: Thiago Jung Bauermann <bauerman(a)linux.ibm.com>
Signed-off-by: Ram Pai <linuxram(a)us.ibm.com>
Signed-off-by: Michael Ellerman <mpe(a)ellerman.id.au>
diff --git a/arch/powerpc/include/asm/mmu_context.h b/arch/powerpc/include/asm/mmu_context.h
index c05efd2e8736..e687ed31d85a 100644
--- a/arch/powerpc/include/asm/mmu_context.h
+++ b/arch/powerpc/include/asm/mmu_context.h
@@ -217,12 +217,6 @@ static inline void enter_lazy_tlb(struct mm_struct *mm,
#endif
}
-static inline int arch_dup_mmap(struct mm_struct *oldmm,
- struct mm_struct *mm)
-{
- return 0;
-}
-
#ifdef CONFIG_PPC_BOOK3E_64
static inline void arch_exit_mmap(struct mm_struct *mm)
{
@@ -247,6 +241,7 @@ static inline void arch_bprm_mm_init(struct mm_struct *mm,
#ifdef CONFIG_PPC_MEM_KEYS
bool arch_vma_access_permitted(struct vm_area_struct *vma, bool write,
bool execute, bool foreign);
+void arch_dup_pkeys(struct mm_struct *oldmm, struct mm_struct *mm);
#else /* CONFIG_PPC_MEM_KEYS */
static inline bool arch_vma_access_permitted(struct vm_area_struct *vma,
bool write, bool execute, bool foreign)
@@ -259,6 +254,7 @@ static inline bool arch_vma_access_permitted(struct vm_area_struct *vma,
#define thread_pkey_regs_save(thread)
#define thread_pkey_regs_restore(new_thread, old_thread)
#define thread_pkey_regs_init(thread)
+#define arch_dup_pkeys(oldmm, mm)
static inline u64 pte_to_hpte_pkey_bits(u64 pteflags)
{
@@ -267,5 +263,12 @@ static inline u64 pte_to_hpte_pkey_bits(u64 pteflags)
#endif /* CONFIG_PPC_MEM_KEYS */
+static inline int arch_dup_mmap(struct mm_struct *oldmm,
+ struct mm_struct *mm)
+{
+ arch_dup_pkeys(oldmm, mm);
+ return 0;
+}
+
#endif /* __KERNEL__ */
#endif /* __ASM_POWERPC_MMU_CONTEXT_H */
diff --git a/arch/powerpc/mm/pkeys.c b/arch/powerpc/mm/pkeys.c
index 04b60a8f6e69..587807763737 100644
--- a/arch/powerpc/mm/pkeys.c
+++ b/arch/powerpc/mm/pkeys.c
@@ -415,3 +415,13 @@ bool arch_vma_access_permitted(struct vm_area_struct *vma, bool write,
return pkey_access_permitted(vma_pkey(vma), write, execute);
}
+
+void arch_dup_pkeys(struct mm_struct *oldmm, struct mm_struct *mm)
+{
+ if (static_branch_likely(&pkey_disabled))
+ return;
+
+ /* Duplicate the oldmm pkey state in mm: */
+ mm_pkey_allocation_map(mm) = mm_pkey_allocation_map(oldmm);
+ mm->context.execute_only_pkey = oldmm->context.execute_only_pkey;
+}
The patch below does not apply to the 4.20-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
>From 2cd4bd192ee94848695c1c052d87913260e10f36 Mon Sep 17 00:00:00 2001
From: Ram Pai <linuxram(a)us.ibm.com>
Date: Thu, 20 Dec 2018 12:03:30 -0800
Subject: [PATCH] powerpc/pkeys: Fix handling of pkey state across fork()
Protection key tracking information is not copied over to the
mm_struct of the child during fork(). This can cause the child to
erroneously allocate keys that were already allocated. Any allocated
execute-only key is lost aswell.
Add code; called by dup_mmap(), to copy the pkey state from parent to
child explicitly.
This problem was originally found by Dave Hansen on x86, which turns
out to be a problem on powerpc aswell.
Fixes: cf43d3b26452 ("powerpc: Enable pkey subsystem")
Cc: stable(a)vger.kernel.org # v4.16+
Reviewed-by: Thiago Jung Bauermann <bauerman(a)linux.ibm.com>
Signed-off-by: Ram Pai <linuxram(a)us.ibm.com>
Signed-off-by: Michael Ellerman <mpe(a)ellerman.id.au>
diff --git a/arch/powerpc/include/asm/mmu_context.h b/arch/powerpc/include/asm/mmu_context.h
index c05efd2e8736..e687ed31d85a 100644
--- a/arch/powerpc/include/asm/mmu_context.h
+++ b/arch/powerpc/include/asm/mmu_context.h
@@ -217,12 +217,6 @@ static inline void enter_lazy_tlb(struct mm_struct *mm,
#endif
}
-static inline int arch_dup_mmap(struct mm_struct *oldmm,
- struct mm_struct *mm)
-{
- return 0;
-}
-
#ifdef CONFIG_PPC_BOOK3E_64
static inline void arch_exit_mmap(struct mm_struct *mm)
{
@@ -247,6 +241,7 @@ static inline void arch_bprm_mm_init(struct mm_struct *mm,
#ifdef CONFIG_PPC_MEM_KEYS
bool arch_vma_access_permitted(struct vm_area_struct *vma, bool write,
bool execute, bool foreign);
+void arch_dup_pkeys(struct mm_struct *oldmm, struct mm_struct *mm);
#else /* CONFIG_PPC_MEM_KEYS */
static inline bool arch_vma_access_permitted(struct vm_area_struct *vma,
bool write, bool execute, bool foreign)
@@ -259,6 +254,7 @@ static inline bool arch_vma_access_permitted(struct vm_area_struct *vma,
#define thread_pkey_regs_save(thread)
#define thread_pkey_regs_restore(new_thread, old_thread)
#define thread_pkey_regs_init(thread)
+#define arch_dup_pkeys(oldmm, mm)
static inline u64 pte_to_hpte_pkey_bits(u64 pteflags)
{
@@ -267,5 +263,12 @@ static inline u64 pte_to_hpte_pkey_bits(u64 pteflags)
#endif /* CONFIG_PPC_MEM_KEYS */
+static inline int arch_dup_mmap(struct mm_struct *oldmm,
+ struct mm_struct *mm)
+{
+ arch_dup_pkeys(oldmm, mm);
+ return 0;
+}
+
#endif /* __KERNEL__ */
#endif /* __ASM_POWERPC_MMU_CONTEXT_H */
diff --git a/arch/powerpc/mm/pkeys.c b/arch/powerpc/mm/pkeys.c
index 04b60a8f6e69..587807763737 100644
--- a/arch/powerpc/mm/pkeys.c
+++ b/arch/powerpc/mm/pkeys.c
@@ -415,3 +415,13 @@ bool arch_vma_access_permitted(struct vm_area_struct *vma, bool write,
return pkey_access_permitted(vma_pkey(vma), write, execute);
}
+
+void arch_dup_pkeys(struct mm_struct *oldmm, struct mm_struct *mm)
+{
+ if (static_branch_likely(&pkey_disabled))
+ return;
+
+ /* Duplicate the oldmm pkey state in mm: */
+ mm_pkey_allocation_map(mm) = mm_pkey_allocation_map(oldmm);
+ mm->context.execute_only_pkey = oldmm->context.execute_only_pkey;
+}
The patch below does not apply to the 3.18-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
>From 6f5b9f018f4c7686fd944d920209d1382d320e4e Mon Sep 17 00:00:00 2001
From: Breno Leitao <leitao(a)debian.org>
Date: Mon, 26 Nov 2018 18:12:00 -0200
Subject: [PATCH] powerpc/tm: Unset MSR[TS] if not recheckpointing
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
There is a TM Bad Thing bug that can be caused when you return from a
signal context in a suspended transaction but with ucontext MSR[TS] unset.
This forces regs->msr[TS] to be set at syscall entrance (since the CPU
state is transactional). It also calls treclaim() to flush the transaction
state, which is done based on the live (mfmsr) MSR state.
Since user context MSR[TS] is not set, then restore_tm_sigcontexts() is not
called, thus, not executing recheckpoint, keeping the CPU state as not
transactional. When calling rfid, SRR1 will have MSR[TS] set, but the CPU
state is non transactional, causing the TM Bad Thing with the following
stack:
[ 33.862316] Bad kernel stack pointer 3fffd9dce3e0 at c00000000000c47c
cpu 0x8: Vector: 700 (Program Check) at [c00000003ff7fd40]
pc: c00000000000c47c: fast_exception_return+0xac/0xb4
lr: 00003fff865f442c
sp: 3fffd9dce3e0
msr: 8000000102a03031
current = 0xc00000041f68b700
paca = 0xc00000000fb84800 softe: 0 irq_happened: 0x01
pid = 1721, comm = tm-signal-sigre
Linux version 4.9.0-3-powerpc64le (debian-kernel(a)lists.debian.org) (gcc version 6.3.0 20170516 (Debian 6.3.0-18) ) #1 SMP Debian 4.9.30-2+deb9u2 (2017-06-26)
WARNING: exception is not recoverable, can't continue
The same problem happens on 32-bits signal handler, and the fix is very
similar, if tm_recheckpoint() is not executed, then regs->msr[TS] should be
zeroed.
This patch also fixes a sparse warning related to lack of indentation when
CONFIG_PPC_TRANSACTIONAL_MEM is set.
Fixes: 2b0a576d15e0e ("powerpc: Add new transactional memory state to the signal context")
CC: Stable <stable(a)vger.kernel.org> # 3.10+
Signed-off-by: Breno Leitao <leitao(a)debian.org>
Tested-by: Michal Suchánek <msuchanek(a)suse.de>
Signed-off-by: Michael Ellerman <mpe(a)ellerman.id.au>
diff --git a/arch/powerpc/kernel/signal_32.c b/arch/powerpc/kernel/signal_32.c
index 7484f43493d3..2d47cc79e5b3 100644
--- a/arch/powerpc/kernel/signal_32.c
+++ b/arch/powerpc/kernel/signal_32.c
@@ -1158,11 +1158,11 @@ SYSCALL_DEFINE0(rt_sigreturn)
{
struct rt_sigframe __user *rt_sf;
struct pt_regs *regs = current_pt_regs();
+ int tm_restore = 0;
#ifdef CONFIG_PPC_TRANSACTIONAL_MEM
struct ucontext __user *uc_transact;
unsigned long msr_hi;
unsigned long tmp;
- int tm_restore = 0;
#endif
/* Always make any pending restarted system calls return -EINTR */
current->restart_block.fn = do_no_restart_syscall;
@@ -1210,11 +1210,19 @@ SYSCALL_DEFINE0(rt_sigreturn)
goto bad;
}
}
- if (!tm_restore)
- /* Fall through, for non-TM restore */
+ if (!tm_restore) {
+ /*
+ * Unset regs->msr because ucontext MSR TS is not
+ * set, and recheckpoint was not called. This avoid
+ * hitting a TM Bad thing at RFID
+ */
+ regs->msr &= ~MSR_TS_MASK;
+ }
+ /* Fall through, for non-TM restore */
#endif
- if (do_setcontext(&rt_sf->uc, regs, 1))
- goto bad;
+ if (!tm_restore)
+ if (do_setcontext(&rt_sf->uc, regs, 1))
+ goto bad;
/*
* It's not clear whether or why it is desirable to save the
diff --git a/arch/powerpc/kernel/signal_64.c b/arch/powerpc/kernel/signal_64.c
index ba093ec5a21f..0935fe6c282a 100644
--- a/arch/powerpc/kernel/signal_64.c
+++ b/arch/powerpc/kernel/signal_64.c
@@ -757,11 +757,23 @@ SYSCALL_DEFINE0(rt_sigreturn)
&uc_transact->uc_mcontext))
goto badframe;
}
- else
- /* Fall through, for non-TM restore */
#endif
- if (restore_sigcontext(current, NULL, 1, &uc->uc_mcontext))
- goto badframe;
+ /* Fall through, for non-TM restore */
+ if (!MSR_TM_ACTIVE(msr)) {
+ /*
+ * Unset MSR[TS] on the thread regs since MSR from user
+ * context does not have MSR active, and recheckpoint was
+ * not called since restore_tm_sigcontexts() was not called
+ * also.
+ *
+ * If not unsetting it, the code can RFID to userspace with
+ * MSR[TS] set, but without CPU in the proper state,
+ * causing a TM bad thing.
+ */
+ current->thread.regs->msr &= ~MSR_TS_MASK;
+ if (restore_sigcontext(current, NULL, 1, &uc->uc_mcontext))
+ goto badframe;
+ }
if (restore_altstack(&uc->uc_stack))
goto badframe;
The patch below does not apply to the 4.4-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
>From 6f5b9f018f4c7686fd944d920209d1382d320e4e Mon Sep 17 00:00:00 2001
From: Breno Leitao <leitao(a)debian.org>
Date: Mon, 26 Nov 2018 18:12:00 -0200
Subject: [PATCH] powerpc/tm: Unset MSR[TS] if not recheckpointing
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
There is a TM Bad Thing bug that can be caused when you return from a
signal context in a suspended transaction but with ucontext MSR[TS] unset.
This forces regs->msr[TS] to be set at syscall entrance (since the CPU
state is transactional). It also calls treclaim() to flush the transaction
state, which is done based on the live (mfmsr) MSR state.
Since user context MSR[TS] is not set, then restore_tm_sigcontexts() is not
called, thus, not executing recheckpoint, keeping the CPU state as not
transactional. When calling rfid, SRR1 will have MSR[TS] set, but the CPU
state is non transactional, causing the TM Bad Thing with the following
stack:
[ 33.862316] Bad kernel stack pointer 3fffd9dce3e0 at c00000000000c47c
cpu 0x8: Vector: 700 (Program Check) at [c00000003ff7fd40]
pc: c00000000000c47c: fast_exception_return+0xac/0xb4
lr: 00003fff865f442c
sp: 3fffd9dce3e0
msr: 8000000102a03031
current = 0xc00000041f68b700
paca = 0xc00000000fb84800 softe: 0 irq_happened: 0x01
pid = 1721, comm = tm-signal-sigre
Linux version 4.9.0-3-powerpc64le (debian-kernel(a)lists.debian.org) (gcc version 6.3.0 20170516 (Debian 6.3.0-18) ) #1 SMP Debian 4.9.30-2+deb9u2 (2017-06-26)
WARNING: exception is not recoverable, can't continue
The same problem happens on 32-bits signal handler, and the fix is very
similar, if tm_recheckpoint() is not executed, then regs->msr[TS] should be
zeroed.
This patch also fixes a sparse warning related to lack of indentation when
CONFIG_PPC_TRANSACTIONAL_MEM is set.
Fixes: 2b0a576d15e0e ("powerpc: Add new transactional memory state to the signal context")
CC: Stable <stable(a)vger.kernel.org> # 3.10+
Signed-off-by: Breno Leitao <leitao(a)debian.org>
Tested-by: Michal Suchánek <msuchanek(a)suse.de>
Signed-off-by: Michael Ellerman <mpe(a)ellerman.id.au>
diff --git a/arch/powerpc/kernel/signal_32.c b/arch/powerpc/kernel/signal_32.c
index 7484f43493d3..2d47cc79e5b3 100644
--- a/arch/powerpc/kernel/signal_32.c
+++ b/arch/powerpc/kernel/signal_32.c
@@ -1158,11 +1158,11 @@ SYSCALL_DEFINE0(rt_sigreturn)
{
struct rt_sigframe __user *rt_sf;
struct pt_regs *regs = current_pt_regs();
+ int tm_restore = 0;
#ifdef CONFIG_PPC_TRANSACTIONAL_MEM
struct ucontext __user *uc_transact;
unsigned long msr_hi;
unsigned long tmp;
- int tm_restore = 0;
#endif
/* Always make any pending restarted system calls return -EINTR */
current->restart_block.fn = do_no_restart_syscall;
@@ -1210,11 +1210,19 @@ SYSCALL_DEFINE0(rt_sigreturn)
goto bad;
}
}
- if (!tm_restore)
- /* Fall through, for non-TM restore */
+ if (!tm_restore) {
+ /*
+ * Unset regs->msr because ucontext MSR TS is not
+ * set, and recheckpoint was not called. This avoid
+ * hitting a TM Bad thing at RFID
+ */
+ regs->msr &= ~MSR_TS_MASK;
+ }
+ /* Fall through, for non-TM restore */
#endif
- if (do_setcontext(&rt_sf->uc, regs, 1))
- goto bad;
+ if (!tm_restore)
+ if (do_setcontext(&rt_sf->uc, regs, 1))
+ goto bad;
/*
* It's not clear whether or why it is desirable to save the
diff --git a/arch/powerpc/kernel/signal_64.c b/arch/powerpc/kernel/signal_64.c
index ba093ec5a21f..0935fe6c282a 100644
--- a/arch/powerpc/kernel/signal_64.c
+++ b/arch/powerpc/kernel/signal_64.c
@@ -757,11 +757,23 @@ SYSCALL_DEFINE0(rt_sigreturn)
&uc_transact->uc_mcontext))
goto badframe;
}
- else
- /* Fall through, for non-TM restore */
#endif
- if (restore_sigcontext(current, NULL, 1, &uc->uc_mcontext))
- goto badframe;
+ /* Fall through, for non-TM restore */
+ if (!MSR_TM_ACTIVE(msr)) {
+ /*
+ * Unset MSR[TS] on the thread regs since MSR from user
+ * context does not have MSR active, and recheckpoint was
+ * not called since restore_tm_sigcontexts() was not called
+ * also.
+ *
+ * If not unsetting it, the code can RFID to userspace with
+ * MSR[TS] set, but without CPU in the proper state,
+ * causing a TM bad thing.
+ */
+ current->thread.regs->msr &= ~MSR_TS_MASK;
+ if (restore_sigcontext(current, NULL, 1, &uc->uc_mcontext))
+ goto badframe;
+ }
if (restore_altstack(&uc->uc_stack))
goto badframe;
The patch below does not apply to the 4.9-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
>From 6f5b9f018f4c7686fd944d920209d1382d320e4e Mon Sep 17 00:00:00 2001
From: Breno Leitao <leitao(a)debian.org>
Date: Mon, 26 Nov 2018 18:12:00 -0200
Subject: [PATCH] powerpc/tm: Unset MSR[TS] if not recheckpointing
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
There is a TM Bad Thing bug that can be caused when you return from a
signal context in a suspended transaction but with ucontext MSR[TS] unset.
This forces regs->msr[TS] to be set at syscall entrance (since the CPU
state is transactional). It also calls treclaim() to flush the transaction
state, which is done based on the live (mfmsr) MSR state.
Since user context MSR[TS] is not set, then restore_tm_sigcontexts() is not
called, thus, not executing recheckpoint, keeping the CPU state as not
transactional. When calling rfid, SRR1 will have MSR[TS] set, but the CPU
state is non transactional, causing the TM Bad Thing with the following
stack:
[ 33.862316] Bad kernel stack pointer 3fffd9dce3e0 at c00000000000c47c
cpu 0x8: Vector: 700 (Program Check) at [c00000003ff7fd40]
pc: c00000000000c47c: fast_exception_return+0xac/0xb4
lr: 00003fff865f442c
sp: 3fffd9dce3e0
msr: 8000000102a03031
current = 0xc00000041f68b700
paca = 0xc00000000fb84800 softe: 0 irq_happened: 0x01
pid = 1721, comm = tm-signal-sigre
Linux version 4.9.0-3-powerpc64le (debian-kernel(a)lists.debian.org) (gcc version 6.3.0 20170516 (Debian 6.3.0-18) ) #1 SMP Debian 4.9.30-2+deb9u2 (2017-06-26)
WARNING: exception is not recoverable, can't continue
The same problem happens on 32-bits signal handler, and the fix is very
similar, if tm_recheckpoint() is not executed, then regs->msr[TS] should be
zeroed.
This patch also fixes a sparse warning related to lack of indentation when
CONFIG_PPC_TRANSACTIONAL_MEM is set.
Fixes: 2b0a576d15e0e ("powerpc: Add new transactional memory state to the signal context")
CC: Stable <stable(a)vger.kernel.org> # 3.10+
Signed-off-by: Breno Leitao <leitao(a)debian.org>
Tested-by: Michal Suchánek <msuchanek(a)suse.de>
Signed-off-by: Michael Ellerman <mpe(a)ellerman.id.au>
diff --git a/arch/powerpc/kernel/signal_32.c b/arch/powerpc/kernel/signal_32.c
index 7484f43493d3..2d47cc79e5b3 100644
--- a/arch/powerpc/kernel/signal_32.c
+++ b/arch/powerpc/kernel/signal_32.c
@@ -1158,11 +1158,11 @@ SYSCALL_DEFINE0(rt_sigreturn)
{
struct rt_sigframe __user *rt_sf;
struct pt_regs *regs = current_pt_regs();
+ int tm_restore = 0;
#ifdef CONFIG_PPC_TRANSACTIONAL_MEM
struct ucontext __user *uc_transact;
unsigned long msr_hi;
unsigned long tmp;
- int tm_restore = 0;
#endif
/* Always make any pending restarted system calls return -EINTR */
current->restart_block.fn = do_no_restart_syscall;
@@ -1210,11 +1210,19 @@ SYSCALL_DEFINE0(rt_sigreturn)
goto bad;
}
}
- if (!tm_restore)
- /* Fall through, for non-TM restore */
+ if (!tm_restore) {
+ /*
+ * Unset regs->msr because ucontext MSR TS is not
+ * set, and recheckpoint was not called. This avoid
+ * hitting a TM Bad thing at RFID
+ */
+ regs->msr &= ~MSR_TS_MASK;
+ }
+ /* Fall through, for non-TM restore */
#endif
- if (do_setcontext(&rt_sf->uc, regs, 1))
- goto bad;
+ if (!tm_restore)
+ if (do_setcontext(&rt_sf->uc, regs, 1))
+ goto bad;
/*
* It's not clear whether or why it is desirable to save the
diff --git a/arch/powerpc/kernel/signal_64.c b/arch/powerpc/kernel/signal_64.c
index ba093ec5a21f..0935fe6c282a 100644
--- a/arch/powerpc/kernel/signal_64.c
+++ b/arch/powerpc/kernel/signal_64.c
@@ -757,11 +757,23 @@ SYSCALL_DEFINE0(rt_sigreturn)
&uc_transact->uc_mcontext))
goto badframe;
}
- else
- /* Fall through, for non-TM restore */
#endif
- if (restore_sigcontext(current, NULL, 1, &uc->uc_mcontext))
- goto badframe;
+ /* Fall through, for non-TM restore */
+ if (!MSR_TM_ACTIVE(msr)) {
+ /*
+ * Unset MSR[TS] on the thread regs since MSR from user
+ * context does not have MSR active, and recheckpoint was
+ * not called since restore_tm_sigcontexts() was not called
+ * also.
+ *
+ * If not unsetting it, the code can RFID to userspace with
+ * MSR[TS] set, but without CPU in the proper state,
+ * causing a TM bad thing.
+ */
+ current->thread.regs->msr &= ~MSR_TS_MASK;
+ if (restore_sigcontext(current, NULL, 1, &uc->uc_mcontext))
+ goto badframe;
+ }
if (restore_altstack(&uc->uc_stack))
goto badframe;
The patch below does not apply to the 4.14-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
>From 6f5b9f018f4c7686fd944d920209d1382d320e4e Mon Sep 17 00:00:00 2001
From: Breno Leitao <leitao(a)debian.org>
Date: Mon, 26 Nov 2018 18:12:00 -0200
Subject: [PATCH] powerpc/tm: Unset MSR[TS] if not recheckpointing
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
There is a TM Bad Thing bug that can be caused when you return from a
signal context in a suspended transaction but with ucontext MSR[TS] unset.
This forces regs->msr[TS] to be set at syscall entrance (since the CPU
state is transactional). It also calls treclaim() to flush the transaction
state, which is done based on the live (mfmsr) MSR state.
Since user context MSR[TS] is not set, then restore_tm_sigcontexts() is not
called, thus, not executing recheckpoint, keeping the CPU state as not
transactional. When calling rfid, SRR1 will have MSR[TS] set, but the CPU
state is non transactional, causing the TM Bad Thing with the following
stack:
[ 33.862316] Bad kernel stack pointer 3fffd9dce3e0 at c00000000000c47c
cpu 0x8: Vector: 700 (Program Check) at [c00000003ff7fd40]
pc: c00000000000c47c: fast_exception_return+0xac/0xb4
lr: 00003fff865f442c
sp: 3fffd9dce3e0
msr: 8000000102a03031
current = 0xc00000041f68b700
paca = 0xc00000000fb84800 softe: 0 irq_happened: 0x01
pid = 1721, comm = tm-signal-sigre
Linux version 4.9.0-3-powerpc64le (debian-kernel(a)lists.debian.org) (gcc version 6.3.0 20170516 (Debian 6.3.0-18) ) #1 SMP Debian 4.9.30-2+deb9u2 (2017-06-26)
WARNING: exception is not recoverable, can't continue
The same problem happens on 32-bits signal handler, and the fix is very
similar, if tm_recheckpoint() is not executed, then regs->msr[TS] should be
zeroed.
This patch also fixes a sparse warning related to lack of indentation when
CONFIG_PPC_TRANSACTIONAL_MEM is set.
Fixes: 2b0a576d15e0e ("powerpc: Add new transactional memory state to the signal context")
CC: Stable <stable(a)vger.kernel.org> # 3.10+
Signed-off-by: Breno Leitao <leitao(a)debian.org>
Tested-by: Michal Suchánek <msuchanek(a)suse.de>
Signed-off-by: Michael Ellerman <mpe(a)ellerman.id.au>
diff --git a/arch/powerpc/kernel/signal_32.c b/arch/powerpc/kernel/signal_32.c
index 7484f43493d3..2d47cc79e5b3 100644
--- a/arch/powerpc/kernel/signal_32.c
+++ b/arch/powerpc/kernel/signal_32.c
@@ -1158,11 +1158,11 @@ SYSCALL_DEFINE0(rt_sigreturn)
{
struct rt_sigframe __user *rt_sf;
struct pt_regs *regs = current_pt_regs();
+ int tm_restore = 0;
#ifdef CONFIG_PPC_TRANSACTIONAL_MEM
struct ucontext __user *uc_transact;
unsigned long msr_hi;
unsigned long tmp;
- int tm_restore = 0;
#endif
/* Always make any pending restarted system calls return -EINTR */
current->restart_block.fn = do_no_restart_syscall;
@@ -1210,11 +1210,19 @@ SYSCALL_DEFINE0(rt_sigreturn)
goto bad;
}
}
- if (!tm_restore)
- /* Fall through, for non-TM restore */
+ if (!tm_restore) {
+ /*
+ * Unset regs->msr because ucontext MSR TS is not
+ * set, and recheckpoint was not called. This avoid
+ * hitting a TM Bad thing at RFID
+ */
+ regs->msr &= ~MSR_TS_MASK;
+ }
+ /* Fall through, for non-TM restore */
#endif
- if (do_setcontext(&rt_sf->uc, regs, 1))
- goto bad;
+ if (!tm_restore)
+ if (do_setcontext(&rt_sf->uc, regs, 1))
+ goto bad;
/*
* It's not clear whether or why it is desirable to save the
diff --git a/arch/powerpc/kernel/signal_64.c b/arch/powerpc/kernel/signal_64.c
index ba093ec5a21f..0935fe6c282a 100644
--- a/arch/powerpc/kernel/signal_64.c
+++ b/arch/powerpc/kernel/signal_64.c
@@ -757,11 +757,23 @@ SYSCALL_DEFINE0(rt_sigreturn)
&uc_transact->uc_mcontext))
goto badframe;
}
- else
- /* Fall through, for non-TM restore */
#endif
- if (restore_sigcontext(current, NULL, 1, &uc->uc_mcontext))
- goto badframe;
+ /* Fall through, for non-TM restore */
+ if (!MSR_TM_ACTIVE(msr)) {
+ /*
+ * Unset MSR[TS] on the thread regs since MSR from user
+ * context does not have MSR active, and recheckpoint was
+ * not called since restore_tm_sigcontexts() was not called
+ * also.
+ *
+ * If not unsetting it, the code can RFID to userspace with
+ * MSR[TS] set, but without CPU in the proper state,
+ * causing a TM bad thing.
+ */
+ current->thread.regs->msr &= ~MSR_TS_MASK;
+ if (restore_sigcontext(current, NULL, 1, &uc->uc_mcontext))
+ goto badframe;
+ }
if (restore_altstack(&uc->uc_stack))
goto badframe;
The patch below does not apply to the 4.4-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
>From e1c3743e1a20647c53b719dbf28b48f45d23f2cd Mon Sep 17 00:00:00 2001
From: Breno Leitao <leitao(a)debian.org>
Date: Wed, 21 Nov 2018 17:21:09 -0200
Subject: [PATCH] powerpc/tm: Set MSR[TS] just prior to recheckpoint
On a signal handler return, the user could set a context with MSR[TS] bits
set, and these bits would be copied to task regs->msr.
At restore_tm_sigcontexts(), after current task regs->msr[TS] bits are set,
several __get_user() are called and then a recheckpoint is executed.
This is a problem since a page fault (in kernel space) could happen when
calling __get_user(). If it happens, the process MSR[TS] bits were
already set, but recheckpoint was not executed, and SPRs are still invalid.
The page fault can cause the current process to be de-scheduled, with
MSR[TS] active and without tm_recheckpoint() being called. More
importantly, without TEXASR[FS] bit set also.
Since TEXASR might not have the FS bit set, and when the process is
scheduled back, it will try to reclaim, which will be aborted because of
the CPU is not in the suspended state, and, then, recheckpoint. This
recheckpoint will restore thread->texasr into TEXASR SPR, which might be
zero, hitting a BUG_ON().
kernel BUG at /build/linux-sf3Co9/linux-4.9.30/arch/powerpc/kernel/tm.S:434!
cpu 0xb: Vector: 700 (Program Check) at [c00000041f1576d0]
pc: c000000000054550: restore_gprs+0xb0/0x180
lr: 0000000000000000
sp: c00000041f157950
msr: 8000000100021033
current = 0xc00000041f143000
paca = 0xc00000000fb86300 softe: 0 irq_happened: 0x01
pid = 1021, comm = kworker/11:1
kernel BUG at /build/linux-sf3Co9/linux-4.9.30/arch/powerpc/kernel/tm.S:434!
Linux version 4.9.0-3-powerpc64le (debian-kernel(a)lists.debian.org) (gcc version 6.3.0 20170516 (Debian 6.3.0-18) ) #1 SMP Debian 4.9.30-2+deb9u2 (2017-06-26)
enter ? for help
[c00000041f157b30] c00000000001bc3c tm_recheckpoint.part.11+0x6c/0xa0
[c00000041f157b70] c00000000001d184 __switch_to+0x1e4/0x4c0
[c00000041f157bd0] c00000000082eeb8 __schedule+0x2f8/0x990
[c00000041f157cb0] c00000000082f598 schedule+0x48/0xc0
[c00000041f157ce0] c0000000000f0d28 worker_thread+0x148/0x610
[c00000041f157d80] c0000000000f96b0 kthread+0x120/0x140
[c00000041f157e30] c00000000000c0e0 ret_from_kernel_thread+0x5c/0x7c
This patch simply delays the MSR[TS] set, so, if there is any page fault in
the __get_user() section, it does not have regs->msr[TS] set, since the TM
structures are still invalid, thus avoiding doing TM operations for
in-kernel exceptions and possible process reschedule.
With this patch, the MSR[TS] will only be set just before recheckpointing
and setting TEXASR[FS] = 1, thus avoiding an interrupt with TM registers in
invalid state.
Other than that, if CONFIG_PREEMPT is set, there might be a preemption just
after setting MSR[TS] and before tm_recheckpoint(), thus, this block must
be atomic from a preemption perspective, thus, calling
preempt_disable/enable() on this code.
It is not possible to move tm_recheckpoint to happen earlier, because it is
required to get the checkpointed registers from userspace, with
__get_user(), thus, the only way to avoid this undesired behavior is
delaying the MSR[TS] set.
The 32-bits signal handler seems to be safe this current issue, but, it
might be exposed to the preemption issue, thus, disabling preemption in
this chunk of code.
Changes from v2:
* Run the critical section with preempt_disable.
Fixes: 87b4e5393af7 ("powerpc/tm: Fix return of active 64bit signals")
Cc: stable(a)vger.kernel.org (v3.9+)
Signed-off-by: Breno Leitao <leitao(a)debian.org>
Signed-off-by: Michael Ellerman <mpe(a)ellerman.id.au>
diff --git a/arch/powerpc/kernel/signal_32.c b/arch/powerpc/kernel/signal_32.c
index 9d39e0eb03ff..7484f43493d3 100644
--- a/arch/powerpc/kernel/signal_32.c
+++ b/arch/powerpc/kernel/signal_32.c
@@ -848,7 +848,23 @@ static long restore_tm_user_regs(struct pt_regs *regs,
/* If TM bits are set to the reserved value, it's an invalid context */
if (MSR_TM_RESV(msr_hi))
return 1;
- /* Pull in the MSR TM bits from the user context */
+
+ /*
+ * Disabling preemption, since it is unsafe to be preempted
+ * with MSR[TS] set without recheckpointing.
+ */
+ preempt_disable();
+
+ /*
+ * CAUTION:
+ * After regs->MSR[TS] being updated, make sure that get_user(),
+ * put_user() or similar functions are *not* called. These
+ * functions can generate page faults which will cause the process
+ * to be de-scheduled with MSR[TS] set but without calling
+ * tm_recheckpoint(). This can cause a bug.
+ *
+ * Pull in the MSR TM bits from the user context
+ */
regs->msr = (regs->msr & ~MSR_TS_MASK) | (msr_hi & MSR_TS_MASK);
/* Now, recheckpoint. This loads up all of the checkpointed (older)
* registers, including FP and V[S]Rs. After recheckpointing, the
@@ -873,6 +889,8 @@ static long restore_tm_user_regs(struct pt_regs *regs,
}
#endif
+ preempt_enable();
+
return 0;
}
#endif
diff --git a/arch/powerpc/kernel/signal_64.c b/arch/powerpc/kernel/signal_64.c
index e53ad11be385..ba093ec5a21f 100644
--- a/arch/powerpc/kernel/signal_64.c
+++ b/arch/powerpc/kernel/signal_64.c
@@ -467,20 +467,6 @@ static long restore_tm_sigcontexts(struct task_struct *tsk,
if (MSR_TM_RESV(msr))
return -EINVAL;
- /* pull in MSR TS bits from user context */
- regs->msr = (regs->msr & ~MSR_TS_MASK) | (msr & MSR_TS_MASK);
-
- /*
- * Ensure that TM is enabled in regs->msr before we leave the signal
- * handler. It could be the case that (a) user disabled the TM bit
- * through the manipulation of the MSR bits in uc_mcontext or (b) the
- * TM bit was disabled because a sufficient number of context switches
- * happened whilst in the signal handler and load_tm overflowed,
- * disabling the TM bit. In either case we can end up with an illegal
- * TM state leading to a TM Bad Thing when we return to userspace.
- */
- regs->msr |= MSR_TM;
-
/* pull in MSR LE from user context */
regs->msr = (regs->msr & ~MSR_LE) | (msr & MSR_LE);
@@ -572,6 +558,34 @@ static long restore_tm_sigcontexts(struct task_struct *tsk,
tm_enable();
/* Make sure the transaction is marked as failed */
tsk->thread.tm_texasr |= TEXASR_FS;
+
+ /*
+ * Disabling preemption, since it is unsafe to be preempted
+ * with MSR[TS] set without recheckpointing.
+ */
+ preempt_disable();
+
+ /* pull in MSR TS bits from user context */
+ regs->msr = (regs->msr & ~MSR_TS_MASK) | (msr & MSR_TS_MASK);
+
+ /*
+ * Ensure that TM is enabled in regs->msr before we leave the signal
+ * handler. It could be the case that (a) user disabled the TM bit
+ * through the manipulation of the MSR bits in uc_mcontext or (b) the
+ * TM bit was disabled because a sufficient number of context switches
+ * happened whilst in the signal handler and load_tm overflowed,
+ * disabling the TM bit. In either case we can end up with an illegal
+ * TM state leading to a TM Bad Thing when we return to userspace.
+ *
+ * CAUTION:
+ * After regs->MSR[TS] being updated, make sure that get_user(),
+ * put_user() or similar functions are *not* called. These
+ * functions can generate page faults which will cause the process
+ * to be de-scheduled with MSR[TS] set but without calling
+ * tm_recheckpoint(). This can cause a bug.
+ */
+ regs->msr |= MSR_TM;
+
/* This loads the checkpointed FP/VEC state, if used */
tm_recheckpoint(&tsk->thread);
@@ -585,6 +599,8 @@ static long restore_tm_sigcontexts(struct task_struct *tsk,
regs->msr |= MSR_VEC;
}
+ preempt_enable();
+
return err;
}
#endif
The patch below does not apply to the 4.4-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
>From be6821f82c3cc36e026f5afd10249988852b35ea Mon Sep 17 00:00:00 2001
From: Filipe Manana <fdmanana(a)suse.com>
Date: Tue, 11 Dec 2018 10:19:45 +0000
Subject: [PATCH] Btrfs: send, fix race with transaction commits that create
snapshots
If we create a snapshot of a snapshot currently being used by a send
operation, we can end up with send failing unexpectedly (returning
-ENOENT error to user space for example). The following diagram shows
how this happens.
CPU 1 CPU2 CPU3
btrfs_ioctl_send()
(...)
create_snapshot()
-> creates snapshot of a
root used by the send
task
btrfs_commit_transaction()
create_pending_snapshot()
__get_inode_info()
btrfs_search_slot()
btrfs_search_slot_get_root()
down_read commit_root_sem
get reference on eb of the
commit root
-> eb with bytenr == X
up_read commit_root_sem
btrfs_cow_block(root node)
btrfs_free_tree_block()
-> creates delayed ref to
free the extent
btrfs_run_delayed_refs()
-> runs the delayed ref,
adds extent to
fs_info->pinned_extents
btrfs_finish_extent_commit()
unpin_extent_range()
-> marks extent as free
in the free space cache
transaction commit finishes
btrfs_start_transaction()
(...)
btrfs_cow_block()
btrfs_alloc_tree_block()
btrfs_reserve_extent()
-> allocates extent at
bytenr == X
btrfs_init_new_buffer(bytenr X)
btrfs_find_create_tree_block()
alloc_extent_buffer(bytenr X)
find_extent_buffer(bytenr X)
-> returns existing eb,
which the send task got
(...)
-> modifies content of the
eb with bytenr == X
-> uses an eb that now
belongs to some other
tree and no more matches
the commit root of the
snapshot, resuts will be
unpredictable
The consequences of this race can be various, and can lead to searches in
the commit root performed by the send task failing unexpectedly (unable to
find inode items, returning -ENOENT to user space, for example) or not
failing because an inode item with the same number was added to the tree
that reused the metadata extent, in which case send can behave incorrectly
in the worst case or just fail later for some reason.
Fix this by performing a copy of the commit root's extent buffer when doing
a search in the context of a send operation.
CC: stable(a)vger.kernel.org # 4.4.x: 1fc28d8e2e9: Btrfs: move get root out of btrfs_search_slot to a helper
CC: stable(a)vger.kernel.org # 4.4.x: f9ddfd0592a: Btrfs: remove unused check of skip_locking
CC: stable(a)vger.kernel.org # 4.4.x
Signed-off-by: Filipe Manana <fdmanana(a)suse.com>
Signed-off-by: David Sterba <dsterba(a)suse.com>
diff --git a/fs/btrfs/ctree.c b/fs/btrfs/ctree.c
index 72745235896f..4252e89df6ae 100644
--- a/fs/btrfs/ctree.c
+++ b/fs/btrfs/ctree.c
@@ -2587,14 +2587,27 @@ static struct extent_buffer *btrfs_search_slot_get_root(struct btrfs_root *root,
root_lock = BTRFS_READ_LOCK;
if (p->search_commit_root) {
- /* The commit roots are read only so we always do read locks */
- if (p->need_commit_sem)
+ /*
+ * The commit roots are read only so we always do read locks,
+ * and we always must hold the commit_root_sem when doing
+ * searches on them, the only exception is send where we don't
+ * want to block transaction commits for a long time, so
+ * we need to clone the commit root in order to avoid races
+ * with transaction commits that create a snapshot of one of
+ * the roots used by a send operation.
+ */
+ if (p->need_commit_sem) {
down_read(&fs_info->commit_root_sem);
- b = root->commit_root;
- extent_buffer_get(b);
- level = btrfs_header_level(b);
- if (p->need_commit_sem)
+ b = btrfs_clone_extent_buffer(root->commit_root);
up_read(&fs_info->commit_root_sem);
+ if (!b)
+ return ERR_PTR(-ENOMEM);
+
+ } else {
+ b = root->commit_root;
+ extent_buffer_get(b);
+ }
+ level = btrfs_header_level(b);
/*
* Ensure that all callers have set skip_locking when
* p->search_commit_root = 1.
@@ -2720,6 +2733,10 @@ int btrfs_search_slot(struct btrfs_trans_handle *trans, struct btrfs_root *root,
again:
prev_cmp = -1;
b = btrfs_search_slot_get_root(root, p, write_lock_level);
+ if (IS_ERR(b)) {
+ ret = PTR_ERR(b);
+ goto done;
+ }
while (b) {
level = btrfs_header_level(b);
The patch below does not apply to the 4.9-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
>From be6821f82c3cc36e026f5afd10249988852b35ea Mon Sep 17 00:00:00 2001
From: Filipe Manana <fdmanana(a)suse.com>
Date: Tue, 11 Dec 2018 10:19:45 +0000
Subject: [PATCH] Btrfs: send, fix race with transaction commits that create
snapshots
If we create a snapshot of a snapshot currently being used by a send
operation, we can end up with send failing unexpectedly (returning
-ENOENT error to user space for example). The following diagram shows
how this happens.
CPU 1 CPU2 CPU3
btrfs_ioctl_send()
(...)
create_snapshot()
-> creates snapshot of a
root used by the send
task
btrfs_commit_transaction()
create_pending_snapshot()
__get_inode_info()
btrfs_search_slot()
btrfs_search_slot_get_root()
down_read commit_root_sem
get reference on eb of the
commit root
-> eb with bytenr == X
up_read commit_root_sem
btrfs_cow_block(root node)
btrfs_free_tree_block()
-> creates delayed ref to
free the extent
btrfs_run_delayed_refs()
-> runs the delayed ref,
adds extent to
fs_info->pinned_extents
btrfs_finish_extent_commit()
unpin_extent_range()
-> marks extent as free
in the free space cache
transaction commit finishes
btrfs_start_transaction()
(...)
btrfs_cow_block()
btrfs_alloc_tree_block()
btrfs_reserve_extent()
-> allocates extent at
bytenr == X
btrfs_init_new_buffer(bytenr X)
btrfs_find_create_tree_block()
alloc_extent_buffer(bytenr X)
find_extent_buffer(bytenr X)
-> returns existing eb,
which the send task got
(...)
-> modifies content of the
eb with bytenr == X
-> uses an eb that now
belongs to some other
tree and no more matches
the commit root of the
snapshot, resuts will be
unpredictable
The consequences of this race can be various, and can lead to searches in
the commit root performed by the send task failing unexpectedly (unable to
find inode items, returning -ENOENT to user space, for example) or not
failing because an inode item with the same number was added to the tree
that reused the metadata extent, in which case send can behave incorrectly
in the worst case or just fail later for some reason.
Fix this by performing a copy of the commit root's extent buffer when doing
a search in the context of a send operation.
CC: stable(a)vger.kernel.org # 4.4.x: 1fc28d8e2e9: Btrfs: move get root out of btrfs_search_slot to a helper
CC: stable(a)vger.kernel.org # 4.4.x: f9ddfd0592a: Btrfs: remove unused check of skip_locking
CC: stable(a)vger.kernel.org # 4.4.x
Signed-off-by: Filipe Manana <fdmanana(a)suse.com>
Signed-off-by: David Sterba <dsterba(a)suse.com>
diff --git a/fs/btrfs/ctree.c b/fs/btrfs/ctree.c
index 72745235896f..4252e89df6ae 100644
--- a/fs/btrfs/ctree.c
+++ b/fs/btrfs/ctree.c
@@ -2587,14 +2587,27 @@ static struct extent_buffer *btrfs_search_slot_get_root(struct btrfs_root *root,
root_lock = BTRFS_READ_LOCK;
if (p->search_commit_root) {
- /* The commit roots are read only so we always do read locks */
- if (p->need_commit_sem)
+ /*
+ * The commit roots are read only so we always do read locks,
+ * and we always must hold the commit_root_sem when doing
+ * searches on them, the only exception is send where we don't
+ * want to block transaction commits for a long time, so
+ * we need to clone the commit root in order to avoid races
+ * with transaction commits that create a snapshot of one of
+ * the roots used by a send operation.
+ */
+ if (p->need_commit_sem) {
down_read(&fs_info->commit_root_sem);
- b = root->commit_root;
- extent_buffer_get(b);
- level = btrfs_header_level(b);
- if (p->need_commit_sem)
+ b = btrfs_clone_extent_buffer(root->commit_root);
up_read(&fs_info->commit_root_sem);
+ if (!b)
+ return ERR_PTR(-ENOMEM);
+
+ } else {
+ b = root->commit_root;
+ extent_buffer_get(b);
+ }
+ level = btrfs_header_level(b);
/*
* Ensure that all callers have set skip_locking when
* p->search_commit_root = 1.
@@ -2720,6 +2733,10 @@ int btrfs_search_slot(struct btrfs_trans_handle *trans, struct btrfs_root *root,
again:
prev_cmp = -1;
b = btrfs_search_slot_get_root(root, p, write_lock_level);
+ if (IS_ERR(b)) {
+ ret = PTR_ERR(b);
+ goto done;
+ }
while (b) {
level = btrfs_header_level(b);
The patch below does not apply to the 4.14-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
>From be6821f82c3cc36e026f5afd10249988852b35ea Mon Sep 17 00:00:00 2001
From: Filipe Manana <fdmanana(a)suse.com>
Date: Tue, 11 Dec 2018 10:19:45 +0000
Subject: [PATCH] Btrfs: send, fix race with transaction commits that create
snapshots
If we create a snapshot of a snapshot currently being used by a send
operation, we can end up with send failing unexpectedly (returning
-ENOENT error to user space for example). The following diagram shows
how this happens.
CPU 1 CPU2 CPU3
btrfs_ioctl_send()
(...)
create_snapshot()
-> creates snapshot of a
root used by the send
task
btrfs_commit_transaction()
create_pending_snapshot()
__get_inode_info()
btrfs_search_slot()
btrfs_search_slot_get_root()
down_read commit_root_sem
get reference on eb of the
commit root
-> eb with bytenr == X
up_read commit_root_sem
btrfs_cow_block(root node)
btrfs_free_tree_block()
-> creates delayed ref to
free the extent
btrfs_run_delayed_refs()
-> runs the delayed ref,
adds extent to
fs_info->pinned_extents
btrfs_finish_extent_commit()
unpin_extent_range()
-> marks extent as free
in the free space cache
transaction commit finishes
btrfs_start_transaction()
(...)
btrfs_cow_block()
btrfs_alloc_tree_block()
btrfs_reserve_extent()
-> allocates extent at
bytenr == X
btrfs_init_new_buffer(bytenr X)
btrfs_find_create_tree_block()
alloc_extent_buffer(bytenr X)
find_extent_buffer(bytenr X)
-> returns existing eb,
which the send task got
(...)
-> modifies content of the
eb with bytenr == X
-> uses an eb that now
belongs to some other
tree and no more matches
the commit root of the
snapshot, resuts will be
unpredictable
The consequences of this race can be various, and can lead to searches in
the commit root performed by the send task failing unexpectedly (unable to
find inode items, returning -ENOENT to user space, for example) or not
failing because an inode item with the same number was added to the tree
that reused the metadata extent, in which case send can behave incorrectly
in the worst case or just fail later for some reason.
Fix this by performing a copy of the commit root's extent buffer when doing
a search in the context of a send operation.
CC: stable(a)vger.kernel.org # 4.4.x: 1fc28d8e2e9: Btrfs: move get root out of btrfs_search_slot to a helper
CC: stable(a)vger.kernel.org # 4.4.x: f9ddfd0592a: Btrfs: remove unused check of skip_locking
CC: stable(a)vger.kernel.org # 4.4.x
Signed-off-by: Filipe Manana <fdmanana(a)suse.com>
Signed-off-by: David Sterba <dsterba(a)suse.com>
diff --git a/fs/btrfs/ctree.c b/fs/btrfs/ctree.c
index 72745235896f..4252e89df6ae 100644
--- a/fs/btrfs/ctree.c
+++ b/fs/btrfs/ctree.c
@@ -2587,14 +2587,27 @@ static struct extent_buffer *btrfs_search_slot_get_root(struct btrfs_root *root,
root_lock = BTRFS_READ_LOCK;
if (p->search_commit_root) {
- /* The commit roots are read only so we always do read locks */
- if (p->need_commit_sem)
+ /*
+ * The commit roots are read only so we always do read locks,
+ * and we always must hold the commit_root_sem when doing
+ * searches on them, the only exception is send where we don't
+ * want to block transaction commits for a long time, so
+ * we need to clone the commit root in order to avoid races
+ * with transaction commits that create a snapshot of one of
+ * the roots used by a send operation.
+ */
+ if (p->need_commit_sem) {
down_read(&fs_info->commit_root_sem);
- b = root->commit_root;
- extent_buffer_get(b);
- level = btrfs_header_level(b);
- if (p->need_commit_sem)
+ b = btrfs_clone_extent_buffer(root->commit_root);
up_read(&fs_info->commit_root_sem);
+ if (!b)
+ return ERR_PTR(-ENOMEM);
+
+ } else {
+ b = root->commit_root;
+ extent_buffer_get(b);
+ }
+ level = btrfs_header_level(b);
/*
* Ensure that all callers have set skip_locking when
* p->search_commit_root = 1.
@@ -2720,6 +2733,10 @@ int btrfs_search_slot(struct btrfs_trans_handle *trans, struct btrfs_root *root,
again:
prev_cmp = -1;
b = btrfs_search_slot_get_root(root, p, write_lock_level);
+ if (IS_ERR(b)) {
+ ret = PTR_ERR(b);
+ goto done;
+ }
while (b) {
level = btrfs_header_level(b);
The patch below does not apply to the 4.14-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
>From d1051d6ebf8ef3517a5a3cf82bba8436d190f1c2 Mon Sep 17 00:00:00 2001
From: Nikolay Borisov <nborisov(a)suse.com>
Date: Wed, 21 Nov 2018 17:10:52 +0200
Subject: [PATCH] btrfs: Fix error handling in btrfs_cleanup_ordered_extents
Running btrfs/124 in a loop hung up on me sporadically with the
following call trace:
btrfs D 0 5760 5324 0x00000000
Call Trace:
? __schedule+0x243/0x800
schedule+0x33/0x90
btrfs_start_ordered_extent+0x10c/0x1b0 [btrfs]
? wait_woken+0xa0/0xa0
btrfs_wait_ordered_range+0xbb/0x100 [btrfs]
btrfs_relocate_block_group+0x1ff/0x230 [btrfs]
btrfs_relocate_chunk+0x49/0x100 [btrfs]
btrfs_balance+0xbeb/0x1740 [btrfs]
btrfs_ioctl_balance+0x2ee/0x380 [btrfs]
btrfs_ioctl+0x1691/0x3110 [btrfs]
? lockdep_hardirqs_on+0xed/0x180
? __handle_mm_fault+0x8e7/0xfb0
? _raw_spin_unlock+0x24/0x30
? __handle_mm_fault+0x8e7/0xfb0
? do_vfs_ioctl+0xa5/0x6e0
? btrfs_ioctl_get_supported_features+0x30/0x30 [btrfs]
do_vfs_ioctl+0xa5/0x6e0
? entry_SYSCALL_64_after_hwframe+0x3e/0xbe
ksys_ioctl+0x3a/0x70
__x64_sys_ioctl+0x16/0x20
do_syscall_64+0x60/0x1b0
entry_SYSCALL_64_after_hwframe+0x49/0xbe
This happens because during page writeback it's valid for
writepage_delalloc to instantiate a delalloc range which doesn't belong
to the page currently being written back.
The reason this case is valid is due to find_lock_delalloc_range
returning any available range after the passed delalloc_start and
ignoring whether the page under writeback is within that range.
In turn ordered extents (OE) are always created for the returned range
from find_lock_delalloc_range. If, however, a failure occurs while OE
are being created then the clean up code in btrfs_cleanup_ordered_extents
will be called.
Unfortunately the code in btrfs_cleanup_ordered_extents doesn't consider
the case of such 'foreign' range being processed and instead it always
assumes that the range OE are created for belongs to the page. This
leads to the first page of such foregin range to not be cleaned up since
it's deliberately missed and skipped by the current cleaning up code.
Fix this by correctly checking whether the current page belongs to the
range being instantiated and if so adjsut the range parameters passed
for cleaning up. If it doesn't, then just clean the whole OE range
directly.
Fixes: 524272607e88 ("btrfs: Handle delalloc error correctly to avoid ordered extent hang")
CC: stable(a)vger.kernel.org # 4.14+
Reviewed-by: Josef Bacik <josef(a)toxicpanda.com>
Signed-off-by: Nikolay Borisov <nborisov(a)suse.com>
Signed-off-by: David Sterba <dsterba(a)suse.com>
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index e7b97c699acf..e1451a69432b 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -110,17 +110,17 @@ static void __endio_write_update_ordered(struct inode *inode,
* extent_clear_unlock_delalloc() to clear both the bits EXTENT_DO_ACCOUNTING
* and EXTENT_DELALLOC simultaneously, because that causes the reserved metadata
* to be released, which we want to happen only when finishing the ordered
- * extent (btrfs_finish_ordered_io()). Also note that the caller of
- * btrfs_run_delalloc_range already does proper cleanup for the first page of
- * the range, that is, it invokes the callback writepage_end_io_hook() for the
- * range of the first page.
+ * extent (btrfs_finish_ordered_io()).
*/
static inline void btrfs_cleanup_ordered_extents(struct inode *inode,
- const u64 offset,
- const u64 bytes)
+ struct page *locked_page,
+ u64 offset, u64 bytes)
{
unsigned long index = offset >> PAGE_SHIFT;
unsigned long end_index = (offset + bytes - 1) >> PAGE_SHIFT;
+ u64 page_start = page_offset(locked_page);
+ u64 page_end = page_start + PAGE_SIZE - 1;
+
struct page *page;
while (index <= end_index) {
@@ -131,8 +131,18 @@ static inline void btrfs_cleanup_ordered_extents(struct inode *inode,
ClearPagePrivate2(page);
put_page(page);
}
- return __endio_write_update_ordered(inode, offset + PAGE_SIZE,
- bytes - PAGE_SIZE, false);
+
+ /*
+ * In case this page belongs to the delalloc range being instantiated
+ * then skip it, since the first page of a range is going to be
+ * properly cleaned up by the caller of run_delalloc_range
+ */
+ if (page_start >= offset && page_end <= (offset + bytes - 1)) {
+ offset += PAGE_SIZE;
+ bytes -= PAGE_SIZE;
+ }
+
+ return __endio_write_update_ordered(inode, offset, bytes, false);
}
static int btrfs_dirty_inode(struct inode *inode);
@@ -1603,7 +1613,8 @@ int btrfs_run_delalloc_range(void *private_data, struct page *locked_page,
write_flags);
}
if (ret)
- btrfs_cleanup_ordered_extents(inode, start, end - start + 1);
+ btrfs_cleanup_ordered_extents(inode, locked_page, start,
+ end - start + 1);
return ret;
}
The patch below does not apply to the 4.19-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
>From d1051d6ebf8ef3517a5a3cf82bba8436d190f1c2 Mon Sep 17 00:00:00 2001
From: Nikolay Borisov <nborisov(a)suse.com>
Date: Wed, 21 Nov 2018 17:10:52 +0200
Subject: [PATCH] btrfs: Fix error handling in btrfs_cleanup_ordered_extents
Running btrfs/124 in a loop hung up on me sporadically with the
following call trace:
btrfs D 0 5760 5324 0x00000000
Call Trace:
? __schedule+0x243/0x800
schedule+0x33/0x90
btrfs_start_ordered_extent+0x10c/0x1b0 [btrfs]
? wait_woken+0xa0/0xa0
btrfs_wait_ordered_range+0xbb/0x100 [btrfs]
btrfs_relocate_block_group+0x1ff/0x230 [btrfs]
btrfs_relocate_chunk+0x49/0x100 [btrfs]
btrfs_balance+0xbeb/0x1740 [btrfs]
btrfs_ioctl_balance+0x2ee/0x380 [btrfs]
btrfs_ioctl+0x1691/0x3110 [btrfs]
? lockdep_hardirqs_on+0xed/0x180
? __handle_mm_fault+0x8e7/0xfb0
? _raw_spin_unlock+0x24/0x30
? __handle_mm_fault+0x8e7/0xfb0
? do_vfs_ioctl+0xa5/0x6e0
? btrfs_ioctl_get_supported_features+0x30/0x30 [btrfs]
do_vfs_ioctl+0xa5/0x6e0
? entry_SYSCALL_64_after_hwframe+0x3e/0xbe
ksys_ioctl+0x3a/0x70
__x64_sys_ioctl+0x16/0x20
do_syscall_64+0x60/0x1b0
entry_SYSCALL_64_after_hwframe+0x49/0xbe
This happens because during page writeback it's valid for
writepage_delalloc to instantiate a delalloc range which doesn't belong
to the page currently being written back.
The reason this case is valid is due to find_lock_delalloc_range
returning any available range after the passed delalloc_start and
ignoring whether the page under writeback is within that range.
In turn ordered extents (OE) are always created for the returned range
from find_lock_delalloc_range. If, however, a failure occurs while OE
are being created then the clean up code in btrfs_cleanup_ordered_extents
will be called.
Unfortunately the code in btrfs_cleanup_ordered_extents doesn't consider
the case of such 'foreign' range being processed and instead it always
assumes that the range OE are created for belongs to the page. This
leads to the first page of such foregin range to not be cleaned up since
it's deliberately missed and skipped by the current cleaning up code.
Fix this by correctly checking whether the current page belongs to the
range being instantiated and if so adjsut the range parameters passed
for cleaning up. If it doesn't, then just clean the whole OE range
directly.
Fixes: 524272607e88 ("btrfs: Handle delalloc error correctly to avoid ordered extent hang")
CC: stable(a)vger.kernel.org # 4.14+
Reviewed-by: Josef Bacik <josef(a)toxicpanda.com>
Signed-off-by: Nikolay Borisov <nborisov(a)suse.com>
Signed-off-by: David Sterba <dsterba(a)suse.com>
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index e7b97c699acf..e1451a69432b 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -110,17 +110,17 @@ static void __endio_write_update_ordered(struct inode *inode,
* extent_clear_unlock_delalloc() to clear both the bits EXTENT_DO_ACCOUNTING
* and EXTENT_DELALLOC simultaneously, because that causes the reserved metadata
* to be released, which we want to happen only when finishing the ordered
- * extent (btrfs_finish_ordered_io()). Also note that the caller of
- * btrfs_run_delalloc_range already does proper cleanup for the first page of
- * the range, that is, it invokes the callback writepage_end_io_hook() for the
- * range of the first page.
+ * extent (btrfs_finish_ordered_io()).
*/
static inline void btrfs_cleanup_ordered_extents(struct inode *inode,
- const u64 offset,
- const u64 bytes)
+ struct page *locked_page,
+ u64 offset, u64 bytes)
{
unsigned long index = offset >> PAGE_SHIFT;
unsigned long end_index = (offset + bytes - 1) >> PAGE_SHIFT;
+ u64 page_start = page_offset(locked_page);
+ u64 page_end = page_start + PAGE_SIZE - 1;
+
struct page *page;
while (index <= end_index) {
@@ -131,8 +131,18 @@ static inline void btrfs_cleanup_ordered_extents(struct inode *inode,
ClearPagePrivate2(page);
put_page(page);
}
- return __endio_write_update_ordered(inode, offset + PAGE_SIZE,
- bytes - PAGE_SIZE, false);
+
+ /*
+ * In case this page belongs to the delalloc range being instantiated
+ * then skip it, since the first page of a range is going to be
+ * properly cleaned up by the caller of run_delalloc_range
+ */
+ if (page_start >= offset && page_end <= (offset + bytes - 1)) {
+ offset += PAGE_SIZE;
+ bytes -= PAGE_SIZE;
+ }
+
+ return __endio_write_update_ordered(inode, offset, bytes, false);
}
static int btrfs_dirty_inode(struct inode *inode);
@@ -1603,7 +1613,8 @@ int btrfs_run_delalloc_range(void *private_data, struct page *locked_page,
write_flags);
}
if (ret)
- btrfs_cleanup_ordered_extents(inode, start, end - start + 1);
+ btrfs_cleanup_ordered_extents(inode, locked_page, start,
+ end - start + 1);
return ret;
}
The patch below does not apply to the 4.20-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
>From d1051d6ebf8ef3517a5a3cf82bba8436d190f1c2 Mon Sep 17 00:00:00 2001
From: Nikolay Borisov <nborisov(a)suse.com>
Date: Wed, 21 Nov 2018 17:10:52 +0200
Subject: [PATCH] btrfs: Fix error handling in btrfs_cleanup_ordered_extents
Running btrfs/124 in a loop hung up on me sporadically with the
following call trace:
btrfs D 0 5760 5324 0x00000000
Call Trace:
? __schedule+0x243/0x800
schedule+0x33/0x90
btrfs_start_ordered_extent+0x10c/0x1b0 [btrfs]
? wait_woken+0xa0/0xa0
btrfs_wait_ordered_range+0xbb/0x100 [btrfs]
btrfs_relocate_block_group+0x1ff/0x230 [btrfs]
btrfs_relocate_chunk+0x49/0x100 [btrfs]
btrfs_balance+0xbeb/0x1740 [btrfs]
btrfs_ioctl_balance+0x2ee/0x380 [btrfs]
btrfs_ioctl+0x1691/0x3110 [btrfs]
? lockdep_hardirqs_on+0xed/0x180
? __handle_mm_fault+0x8e7/0xfb0
? _raw_spin_unlock+0x24/0x30
? __handle_mm_fault+0x8e7/0xfb0
? do_vfs_ioctl+0xa5/0x6e0
? btrfs_ioctl_get_supported_features+0x30/0x30 [btrfs]
do_vfs_ioctl+0xa5/0x6e0
? entry_SYSCALL_64_after_hwframe+0x3e/0xbe
ksys_ioctl+0x3a/0x70
__x64_sys_ioctl+0x16/0x20
do_syscall_64+0x60/0x1b0
entry_SYSCALL_64_after_hwframe+0x49/0xbe
This happens because during page writeback it's valid for
writepage_delalloc to instantiate a delalloc range which doesn't belong
to the page currently being written back.
The reason this case is valid is due to find_lock_delalloc_range
returning any available range after the passed delalloc_start and
ignoring whether the page under writeback is within that range.
In turn ordered extents (OE) are always created for the returned range
from find_lock_delalloc_range. If, however, a failure occurs while OE
are being created then the clean up code in btrfs_cleanup_ordered_extents
will be called.
Unfortunately the code in btrfs_cleanup_ordered_extents doesn't consider
the case of such 'foreign' range being processed and instead it always
assumes that the range OE are created for belongs to the page. This
leads to the first page of such foregin range to not be cleaned up since
it's deliberately missed and skipped by the current cleaning up code.
Fix this by correctly checking whether the current page belongs to the
range being instantiated and if so adjsut the range parameters passed
for cleaning up. If it doesn't, then just clean the whole OE range
directly.
Fixes: 524272607e88 ("btrfs: Handle delalloc error correctly to avoid ordered extent hang")
CC: stable(a)vger.kernel.org # 4.14+
Reviewed-by: Josef Bacik <josef(a)toxicpanda.com>
Signed-off-by: Nikolay Borisov <nborisov(a)suse.com>
Signed-off-by: David Sterba <dsterba(a)suse.com>
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index e7b97c699acf..e1451a69432b 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -110,17 +110,17 @@ static void __endio_write_update_ordered(struct inode *inode,
* extent_clear_unlock_delalloc() to clear both the bits EXTENT_DO_ACCOUNTING
* and EXTENT_DELALLOC simultaneously, because that causes the reserved metadata
* to be released, which we want to happen only when finishing the ordered
- * extent (btrfs_finish_ordered_io()). Also note that the caller of
- * btrfs_run_delalloc_range already does proper cleanup for the first page of
- * the range, that is, it invokes the callback writepage_end_io_hook() for the
- * range of the first page.
+ * extent (btrfs_finish_ordered_io()).
*/
static inline void btrfs_cleanup_ordered_extents(struct inode *inode,
- const u64 offset,
- const u64 bytes)
+ struct page *locked_page,
+ u64 offset, u64 bytes)
{
unsigned long index = offset >> PAGE_SHIFT;
unsigned long end_index = (offset + bytes - 1) >> PAGE_SHIFT;
+ u64 page_start = page_offset(locked_page);
+ u64 page_end = page_start + PAGE_SIZE - 1;
+
struct page *page;
while (index <= end_index) {
@@ -131,8 +131,18 @@ static inline void btrfs_cleanup_ordered_extents(struct inode *inode,
ClearPagePrivate2(page);
put_page(page);
}
- return __endio_write_update_ordered(inode, offset + PAGE_SIZE,
- bytes - PAGE_SIZE, false);
+
+ /*
+ * In case this page belongs to the delalloc range being instantiated
+ * then skip it, since the first page of a range is going to be
+ * properly cleaned up by the caller of run_delalloc_range
+ */
+ if (page_start >= offset && page_end <= (offset + bytes - 1)) {
+ offset += PAGE_SIZE;
+ bytes -= PAGE_SIZE;
+ }
+
+ return __endio_write_update_ordered(inode, offset, bytes, false);
}
static int btrfs_dirty_inode(struct inode *inode);
@@ -1603,7 +1613,8 @@ int btrfs_run_delalloc_range(void *private_data, struct page *locked_page,
write_flags);
}
if (ret)
- btrfs_cleanup_ordered_extents(inode, start, end - start + 1);
+ btrfs_cleanup_ordered_extents(inode, locked_page, start,
+ end - start + 1);
return ret;
}
The patch below does not apply to the 4.14-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
>From a5fb11429167ee6ddeeacc554efaf5776b36433a Mon Sep 17 00:00:00 2001
From: Filipe Manana <fdmanana(a)suse.com>
Date: Mon, 26 Nov 2018 20:07:17 +0000
Subject: [PATCH] Btrfs: fix deadlock with memory reclaim during scrub
When a transaction commit starts, it attempts to pause scrub and it blocks
until the scrub is paused. So while the transaction is blocked waiting for
scrub to pause, we can not do memory allocation with GFP_KERNEL from scrub,
otherwise we risk getting into a deadlock with reclaim.
Checking for scrub pause requests is done early at the beginning of the
while loop of scrub_stripe() and later in the loop, scrub_extent() and
scrub_raid56_parity() are called, which in turn call scrub_pages() and
scrub_pages_for_parity() respectively. These last two functions do memory
allocations using GFP_KERNEL. Same problem could happen while scrubbing
the super blocks, since it calls scrub_pages().
We also can not have any of the worker tasks, created by the scrub task,
doing GFP_KERNEL allocations, because before pausing, the scrub task waits
for all the worker tasks to complete (also done at scrub_stripe()).
So make sure GFP_NOFS is used for the memory allocations because at any
time a scrub pause request can happen from another task that started to
commit a transaction.
Fixes: 58c4e173847a ("btrfs: scrub: use GFP_KERNEL on the submission path")
CC: stable(a)vger.kernel.org # 4.6+
Reviewed-by: Nikolay Borisov <nborisov(a)suse.com>
Signed-off-by: Filipe Manana <fdmanana(a)suse.com>
Reviewed-by: David Sterba <dsterba(a)suse.com>
Signed-off-by: David Sterba <dsterba(a)suse.com>
diff --git a/fs/btrfs/scrub.c b/fs/btrfs/scrub.c
index 902819d3cf41..bbd1b36f4918 100644
--- a/fs/btrfs/scrub.c
+++ b/fs/btrfs/scrub.c
@@ -322,6 +322,7 @@ static struct full_stripe_lock *insert_full_stripe_lock(
struct rb_node *parent = NULL;
struct full_stripe_lock *entry;
struct full_stripe_lock *ret;
+ unsigned int nofs_flag;
lockdep_assert_held(&locks_root->lock);
@@ -339,8 +340,17 @@ static struct full_stripe_lock *insert_full_stripe_lock(
}
}
- /* Insert new lock */
+ /*
+ * Insert new lock.
+ *
+ * We must use GFP_NOFS because the scrub task might be waiting for a
+ * worker task executing this function and in turn a transaction commit
+ * might be waiting the scrub task to pause (which needs to wait for all
+ * the worker tasks to complete before pausing).
+ */
+ nofs_flag = memalloc_nofs_save();
ret = kmalloc(sizeof(*ret), GFP_KERNEL);
+ memalloc_nofs_restore(nofs_flag);
if (!ret)
return ERR_PTR(-ENOMEM);
ret->logical = fstripe_logical;
@@ -1620,8 +1630,19 @@ static int scrub_add_page_to_wr_bio(struct scrub_ctx *sctx,
mutex_lock(&sctx->wr_lock);
again:
if (!sctx->wr_curr_bio) {
+ unsigned int nofs_flag;
+
+ /*
+ * We must use GFP_NOFS because the scrub task might be waiting
+ * for a worker task executing this function and in turn a
+ * transaction commit might be waiting the scrub task to pause
+ * (which needs to wait for all the worker tasks to complete
+ * before pausing).
+ */
+ nofs_flag = memalloc_nofs_save();
sctx->wr_curr_bio = kzalloc(sizeof(*sctx->wr_curr_bio),
GFP_KERNEL);
+ memalloc_nofs_restore(nofs_flag);
if (!sctx->wr_curr_bio) {
mutex_unlock(&sctx->wr_lock);
return -ENOMEM;
@@ -3772,6 +3793,7 @@ int btrfs_scrub_dev(struct btrfs_fs_info *fs_info, u64 devid, u64 start,
struct scrub_ctx *sctx;
int ret;
struct btrfs_device *dev;
+ unsigned int nofs_flag;
if (btrfs_fs_closing(fs_info))
return -EINVAL;
@@ -3875,6 +3897,16 @@ int btrfs_scrub_dev(struct btrfs_fs_info *fs_info, u64 devid, u64 start,
atomic_inc(&fs_info->scrubs_running);
mutex_unlock(&fs_info->scrub_lock);
+ /*
+ * In order to avoid deadlock with reclaim when there is a transaction
+ * trying to pause scrub, make sure we use GFP_NOFS for all the
+ * allocations done at btrfs_scrub_pages() and scrub_pages_for_parity()
+ * invoked by our callees. The pausing request is done when the
+ * transaction commit starts, and it blocks the transaction until scrub
+ * is paused (done at specific points at scrub_stripe() or right above
+ * before incrementing fs_info->scrubs_running).
+ */
+ nofs_flag = memalloc_nofs_save();
if (!is_dev_replace) {
/*
* by holding device list mutex, we can
@@ -3887,6 +3919,7 @@ int btrfs_scrub_dev(struct btrfs_fs_info *fs_info, u64 devid, u64 start,
if (!ret)
ret = scrub_enumerate_chunks(sctx, dev, start, end);
+ memalloc_nofs_restore(nofs_flag);
wait_event(sctx->list_wait, atomic_read(&sctx->bios_in_flight) == 0);
atomic_dec(&fs_info->scrubs_running);
The patch below does not apply to the 4.19-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
>From a5fb11429167ee6ddeeacc554efaf5776b36433a Mon Sep 17 00:00:00 2001
From: Filipe Manana <fdmanana(a)suse.com>
Date: Mon, 26 Nov 2018 20:07:17 +0000
Subject: [PATCH] Btrfs: fix deadlock with memory reclaim during scrub
When a transaction commit starts, it attempts to pause scrub and it blocks
until the scrub is paused. So while the transaction is blocked waiting for
scrub to pause, we can not do memory allocation with GFP_KERNEL from scrub,
otherwise we risk getting into a deadlock with reclaim.
Checking for scrub pause requests is done early at the beginning of the
while loop of scrub_stripe() and later in the loop, scrub_extent() and
scrub_raid56_parity() are called, which in turn call scrub_pages() and
scrub_pages_for_parity() respectively. These last two functions do memory
allocations using GFP_KERNEL. Same problem could happen while scrubbing
the super blocks, since it calls scrub_pages().
We also can not have any of the worker tasks, created by the scrub task,
doing GFP_KERNEL allocations, because before pausing, the scrub task waits
for all the worker tasks to complete (also done at scrub_stripe()).
So make sure GFP_NOFS is used for the memory allocations because at any
time a scrub pause request can happen from another task that started to
commit a transaction.
Fixes: 58c4e173847a ("btrfs: scrub: use GFP_KERNEL on the submission path")
CC: stable(a)vger.kernel.org # 4.6+
Reviewed-by: Nikolay Borisov <nborisov(a)suse.com>
Signed-off-by: Filipe Manana <fdmanana(a)suse.com>
Reviewed-by: David Sterba <dsterba(a)suse.com>
Signed-off-by: David Sterba <dsterba(a)suse.com>
diff --git a/fs/btrfs/scrub.c b/fs/btrfs/scrub.c
index 902819d3cf41..bbd1b36f4918 100644
--- a/fs/btrfs/scrub.c
+++ b/fs/btrfs/scrub.c
@@ -322,6 +322,7 @@ static struct full_stripe_lock *insert_full_stripe_lock(
struct rb_node *parent = NULL;
struct full_stripe_lock *entry;
struct full_stripe_lock *ret;
+ unsigned int nofs_flag;
lockdep_assert_held(&locks_root->lock);
@@ -339,8 +340,17 @@ static struct full_stripe_lock *insert_full_stripe_lock(
}
}
- /* Insert new lock */
+ /*
+ * Insert new lock.
+ *
+ * We must use GFP_NOFS because the scrub task might be waiting for a
+ * worker task executing this function and in turn a transaction commit
+ * might be waiting the scrub task to pause (which needs to wait for all
+ * the worker tasks to complete before pausing).
+ */
+ nofs_flag = memalloc_nofs_save();
ret = kmalloc(sizeof(*ret), GFP_KERNEL);
+ memalloc_nofs_restore(nofs_flag);
if (!ret)
return ERR_PTR(-ENOMEM);
ret->logical = fstripe_logical;
@@ -1620,8 +1630,19 @@ static int scrub_add_page_to_wr_bio(struct scrub_ctx *sctx,
mutex_lock(&sctx->wr_lock);
again:
if (!sctx->wr_curr_bio) {
+ unsigned int nofs_flag;
+
+ /*
+ * We must use GFP_NOFS because the scrub task might be waiting
+ * for a worker task executing this function and in turn a
+ * transaction commit might be waiting the scrub task to pause
+ * (which needs to wait for all the worker tasks to complete
+ * before pausing).
+ */
+ nofs_flag = memalloc_nofs_save();
sctx->wr_curr_bio = kzalloc(sizeof(*sctx->wr_curr_bio),
GFP_KERNEL);
+ memalloc_nofs_restore(nofs_flag);
if (!sctx->wr_curr_bio) {
mutex_unlock(&sctx->wr_lock);
return -ENOMEM;
@@ -3772,6 +3793,7 @@ int btrfs_scrub_dev(struct btrfs_fs_info *fs_info, u64 devid, u64 start,
struct scrub_ctx *sctx;
int ret;
struct btrfs_device *dev;
+ unsigned int nofs_flag;
if (btrfs_fs_closing(fs_info))
return -EINVAL;
@@ -3875,6 +3897,16 @@ int btrfs_scrub_dev(struct btrfs_fs_info *fs_info, u64 devid, u64 start,
atomic_inc(&fs_info->scrubs_running);
mutex_unlock(&fs_info->scrub_lock);
+ /*
+ * In order to avoid deadlock with reclaim when there is a transaction
+ * trying to pause scrub, make sure we use GFP_NOFS for all the
+ * allocations done at btrfs_scrub_pages() and scrub_pages_for_parity()
+ * invoked by our callees. The pausing request is done when the
+ * transaction commit starts, and it blocks the transaction until scrub
+ * is paused (done at specific points at scrub_stripe() or right above
+ * before incrementing fs_info->scrubs_running).
+ */
+ nofs_flag = memalloc_nofs_save();
if (!is_dev_replace) {
/*
* by holding device list mutex, we can
@@ -3887,6 +3919,7 @@ int btrfs_scrub_dev(struct btrfs_fs_info *fs_info, u64 devid, u64 start,
if (!ret)
ret = scrub_enumerate_chunks(sctx, dev, start, end);
+ memalloc_nofs_restore(nofs_flag);
wait_event(sctx->list_wait, atomic_read(&sctx->bios_in_flight) == 0);
atomic_dec(&fs_info->scrubs_running);
The patch below does not apply to the 4.4-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
>From 0d228ece59a35a9b9e8ff0d40653234a6d90f61e Mon Sep 17 00:00:00 2001
From: Anand Jain <anand.jain(a)oracle.com>
Date: Sun, 11 Nov 2018 22:22:17 +0800
Subject: [PATCH] btrfs: dev-replace: go back to suspended state if target
device is missing
At the time of forced unmount we place the running replace to
BTRFS_IOCTL_DEV_REPLACE_STATE_SUSPENDED state, so when the system comes
back and expect the target device is missing.
Then let the replace state continue to be in
BTRFS_IOCTL_DEV_REPLACE_STATE_SUSPENDED state instead of
BTRFS_IOCTL_DEV_REPLACE_STATE_STARTED as there isn't any matching scrub
running as part of replace.
Fixes: e93c89c1aaaa ("Btrfs: add new sources for device replace code")
CC: stable(a)vger.kernel.org # 4.4+
Signed-off-by: Anand Jain <anand.jain(a)oracle.com>
Reviewed-by: David Sterba <dsterba(a)suse.com>
Signed-off-by: David Sterba <dsterba(a)suse.com>
diff --git a/fs/btrfs/dev-replace.c b/fs/btrfs/dev-replace.c
index 32da6901dc88..11df8f778b63 100644
--- a/fs/btrfs/dev-replace.c
+++ b/fs/btrfs/dev-replace.c
@@ -890,6 +890,8 @@ int btrfs_resume_dev_replace_async(struct btrfs_fs_info *fs_info)
"cannot continue dev_replace, tgtdev is missing");
btrfs_info(fs_info,
"you may cancel the operation after 'mount -o degraded'");
+ dev_replace->replace_state =
+ BTRFS_IOCTL_DEV_REPLACE_STATE_SUSPENDED;
btrfs_dev_replace_write_unlock(dev_replace);
return 0;
}
The patch below does not apply to the 4.9-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
>From 0d228ece59a35a9b9e8ff0d40653234a6d90f61e Mon Sep 17 00:00:00 2001
From: Anand Jain <anand.jain(a)oracle.com>
Date: Sun, 11 Nov 2018 22:22:17 +0800
Subject: [PATCH] btrfs: dev-replace: go back to suspended state if target
device is missing
At the time of forced unmount we place the running replace to
BTRFS_IOCTL_DEV_REPLACE_STATE_SUSPENDED state, so when the system comes
back and expect the target device is missing.
Then let the replace state continue to be in
BTRFS_IOCTL_DEV_REPLACE_STATE_SUSPENDED state instead of
BTRFS_IOCTL_DEV_REPLACE_STATE_STARTED as there isn't any matching scrub
running as part of replace.
Fixes: e93c89c1aaaa ("Btrfs: add new sources for device replace code")
CC: stable(a)vger.kernel.org # 4.4+
Signed-off-by: Anand Jain <anand.jain(a)oracle.com>
Reviewed-by: David Sterba <dsterba(a)suse.com>
Signed-off-by: David Sterba <dsterba(a)suse.com>
diff --git a/fs/btrfs/dev-replace.c b/fs/btrfs/dev-replace.c
index 32da6901dc88..11df8f778b63 100644
--- a/fs/btrfs/dev-replace.c
+++ b/fs/btrfs/dev-replace.c
@@ -890,6 +890,8 @@ int btrfs_resume_dev_replace_async(struct btrfs_fs_info *fs_info)
"cannot continue dev_replace, tgtdev is missing");
btrfs_info(fs_info,
"you may cancel the operation after 'mount -o degraded'");
+ dev_replace->replace_state =
+ BTRFS_IOCTL_DEV_REPLACE_STATE_SUSPENDED;
btrfs_dev_replace_write_unlock(dev_replace);
return 0;
}
The patch below does not apply to the 4.14-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
>From 65b6657672388b72822e0367f06d41c1e3ffb5bb Mon Sep 17 00:00:00 2001
From: Jernej Skrabec <jernej.skrabec(a)siol.net>
Date: Sun, 4 Nov 2018 19:26:40 +0100
Subject: [PATCH] clk: sunxi-ng: Use u64 for calculation of NM rate
Allwinner H6 SoC has multiplier N range between 1 and 254. Since parent
rate is 24MHz, intermediate result when calculating final rate easily
overflows 32 bit variable.
Because of that, introduce function for calculating clock rate which
uses 64 bit variable for intermediate result.
Fixes: 6174a1e24b0d ("clk: sunxi-ng: Add N-M-factor clock support")
Fixes: ee28648cb2b4 ("clk: sunxi-ng: Remove the use of rational computations")
CC: <stable(a)vger.kernel.org>
Signed-off-by: Jernej Skrabec <jernej.skrabec(a)siol.net>
Signed-off-by: Maxime Ripard <maxime.ripard(a)bootlin.com>
diff --git a/drivers/clk/sunxi-ng/ccu_nm.c b/drivers/clk/sunxi-ng/ccu_nm.c
index 6fe3c14f7b2d..424d8635b053 100644
--- a/drivers/clk/sunxi-ng/ccu_nm.c
+++ b/drivers/clk/sunxi-ng/ccu_nm.c
@@ -19,6 +19,17 @@ struct _ccu_nm {
unsigned long m, min_m, max_m;
};
+static unsigned long ccu_nm_calc_rate(unsigned long parent,
+ unsigned long n, unsigned long m)
+{
+ u64 rate = parent;
+
+ rate *= n;
+ do_div(rate, m);
+
+ return rate;
+}
+
static void ccu_nm_find_best(unsigned long parent, unsigned long rate,
struct _ccu_nm *nm)
{
@@ -28,7 +39,8 @@ static void ccu_nm_find_best(unsigned long parent, unsigned long rate,
for (_n = nm->min_n; _n <= nm->max_n; _n++) {
for (_m = nm->min_m; _m <= nm->max_m; _m++) {
- unsigned long tmp_rate = parent * _n / _m;
+ unsigned long tmp_rate = ccu_nm_calc_rate(parent,
+ _n, _m);
if (tmp_rate > rate)
continue;
@@ -100,7 +112,7 @@ static unsigned long ccu_nm_recalc_rate(struct clk_hw *hw,
if (ccu_sdm_helper_is_enabled(&nm->common, &nm->sdm))
rate = ccu_sdm_helper_read_rate(&nm->common, &nm->sdm, m, n);
else
- rate = parent_rate * n / m;
+ rate = ccu_nm_calc_rate(parent_rate, n, m);
if (nm->common.features & CCU_FEATURE_FIXED_POSTDIV)
rate /= nm->fixed_post_div;
@@ -149,7 +161,7 @@ static long ccu_nm_round_rate(struct clk_hw *hw, unsigned long rate,
_nm.max_m = nm->m.max ?: 1 << nm->m.width;
ccu_nm_find_best(*parent_rate, rate, &_nm);
- rate = *parent_rate * _nm.n / _nm.m;
+ rate = ccu_nm_calc_rate(*parent_rate, _nm.n, _nm.m);
if (nm->common.features & CCU_FEATURE_FIXED_POSTDIV)
rate /= nm->fixed_post_div;
The patch below does not apply to the 4.19-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
>From 1b3ab5ad1b8ad99bae76ec583809c5f5a31c707c Mon Sep 17 00:00:00 2001
From: Sean Christopherson <sean.j.christopherson(a)intel.com>
Date: Mon, 3 Dec 2018 13:52:51 -0800
Subject: [PATCH] KVM: nVMX: Free the VMREAD/VMWRITE bitmaps if
alloc_kvm_area() fails
Fixes: 34a1cd60d17f ("kvm: x86: vmx: move some vmx setting from vmx_init() to hardware_setup()")
Cc: stable(a)vger.kernel.org
Signed-off-by: Sean Christopherson <sean.j.christopherson(a)intel.com>
Signed-off-by: Paolo Bonzini <pbonzini(a)redhat.com>
diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index c379d0bfdcba..3ec47b7a94d6 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -8037,13 +8037,16 @@ static __init int hardware_setup(void)
kvm_mce_cap_supported |= MCG_LMCE_P;
- return alloc_kvm_area();
+ r = alloc_kvm_area();
+ if (r)
+ goto out;
+ return 0;
out:
for (i = 0; i < VMX_BITMAP_NR; i++)
free_page((unsigned long)vmx_bitmap[i]);
- return r;
+ return r;
}
static __exit void hardware_unsetup(void)
The patch below does not apply to the 4.4-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
>From 1b3ab5ad1b8ad99bae76ec583809c5f5a31c707c Mon Sep 17 00:00:00 2001
From: Sean Christopherson <sean.j.christopherson(a)intel.com>
Date: Mon, 3 Dec 2018 13:52:51 -0800
Subject: [PATCH] KVM: nVMX: Free the VMREAD/VMWRITE bitmaps if
alloc_kvm_area() fails
Fixes: 34a1cd60d17f ("kvm: x86: vmx: move some vmx setting from vmx_init() to hardware_setup()")
Cc: stable(a)vger.kernel.org
Signed-off-by: Sean Christopherson <sean.j.christopherson(a)intel.com>
Signed-off-by: Paolo Bonzini <pbonzini(a)redhat.com>
diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index c379d0bfdcba..3ec47b7a94d6 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -8037,13 +8037,16 @@ static __init int hardware_setup(void)
kvm_mce_cap_supported |= MCG_LMCE_P;
- return alloc_kvm_area();
+ r = alloc_kvm_area();
+ if (r)
+ goto out;
+ return 0;
out:
for (i = 0; i < VMX_BITMAP_NR; i++)
free_page((unsigned long)vmx_bitmap[i]);
- return r;
+ return r;
}
static __exit void hardware_unsetup(void)
The patch below does not apply to the 4.9-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
>From 1b3ab5ad1b8ad99bae76ec583809c5f5a31c707c Mon Sep 17 00:00:00 2001
From: Sean Christopherson <sean.j.christopherson(a)intel.com>
Date: Mon, 3 Dec 2018 13:52:51 -0800
Subject: [PATCH] KVM: nVMX: Free the VMREAD/VMWRITE bitmaps if
alloc_kvm_area() fails
Fixes: 34a1cd60d17f ("kvm: x86: vmx: move some vmx setting from vmx_init() to hardware_setup()")
Cc: stable(a)vger.kernel.org
Signed-off-by: Sean Christopherson <sean.j.christopherson(a)intel.com>
Signed-off-by: Paolo Bonzini <pbonzini(a)redhat.com>
diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index c379d0bfdcba..3ec47b7a94d6 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -8037,13 +8037,16 @@ static __init int hardware_setup(void)
kvm_mce_cap_supported |= MCG_LMCE_P;
- return alloc_kvm_area();
+ r = alloc_kvm_area();
+ if (r)
+ goto out;
+ return 0;
out:
for (i = 0; i < VMX_BITMAP_NR; i++)
free_page((unsigned long)vmx_bitmap[i]);
- return r;
+ return r;
}
static __exit void hardware_unsetup(void)
The patch below does not apply to the 4.19-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
>From 234ff0b729ad882d20f7996591a964965647addf Mon Sep 17 00:00:00 2001
From: Paul Mackerras <paulus(a)ozlabs.org>
Date: Fri, 16 Nov 2018 21:28:18 +1100
Subject: [PATCH] KVM: PPC: Book3S HV: Fix race between kvm_unmap_hva_range and
MMU mode switch
Testing has revealed an occasional crash which appears to be caused
by a race between kvmppc_switch_mmu_to_hpt and kvm_unmap_hva_range_hv.
The symptom is a NULL pointer dereference in __find_linux_pte() called
from kvm_unmap_radix() with kvm->arch.pgtable == NULL.
Looking at kvmppc_switch_mmu_to_hpt(), it does indeed clear
kvm->arch.pgtable (via kvmppc_free_radix()) before setting
kvm->arch.radix to NULL, and there is nothing to prevent
kvm_unmap_hva_range_hv() or the other MMU callback functions from
being called concurrently with kvmppc_switch_mmu_to_hpt() or
kvmppc_switch_mmu_to_radix().
This patch therefore adds calls to spin_lock/unlock on the kvm->mmu_lock
around the assignments to kvm->arch.radix, and makes sure that the
partition-scoped radix tree or HPT is only freed after changing
kvm->arch.radix.
This also takes the kvm->mmu_lock in kvmppc_rmap_reset() to make sure
that the clearing of each rmap array (one per memslot) doesn't happen
concurrently with use of the array in the kvm_unmap_hva_range_hv()
or the other MMU callbacks.
Fixes: 18c3640cefc7 ("KVM: PPC: Book3S HV: Add infrastructure for running HPT guests on radix host")
Cc: stable(a)vger.kernel.org # v4.15+
Signed-off-by: Paul Mackerras <paulus(a)ozlabs.org>
diff --git a/arch/powerpc/kvm/book3s_64_mmu_hv.c b/arch/powerpc/kvm/book3s_64_mmu_hv.c
index c615617e78ac..a18afda3d0f0 100644
--- a/arch/powerpc/kvm/book3s_64_mmu_hv.c
+++ b/arch/powerpc/kvm/book3s_64_mmu_hv.c
@@ -743,12 +743,15 @@ void kvmppc_rmap_reset(struct kvm *kvm)
srcu_idx = srcu_read_lock(&kvm->srcu);
slots = kvm_memslots(kvm);
kvm_for_each_memslot(memslot, slots) {
+ /* Mutual exclusion with kvm_unmap_hva_range etc. */
+ spin_lock(&kvm->mmu_lock);
/*
* This assumes it is acceptable to lose reference and
* change bits across a reset.
*/
memset(memslot->arch.rmap, 0,
memslot->npages * sizeof(*memslot->arch.rmap));
+ spin_unlock(&kvm->mmu_lock);
}
srcu_read_unlock(&kvm->srcu, srcu_idx);
}
diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
index a56f8413758a..ab43306c4ea1 100644
--- a/arch/powerpc/kvm/book3s_hv.c
+++ b/arch/powerpc/kvm/book3s_hv.c
@@ -4532,12 +4532,15 @@ int kvmppc_switch_mmu_to_hpt(struct kvm *kvm)
{
if (nesting_enabled(kvm))
kvmhv_release_all_nested(kvm);
+ kvmppc_rmap_reset(kvm);
+ kvm->arch.process_table = 0;
+ /* Mutual exclusion with kvm_unmap_hva_range etc. */
+ spin_lock(&kvm->mmu_lock);
+ kvm->arch.radix = 0;
+ spin_unlock(&kvm->mmu_lock);
kvmppc_free_radix(kvm);
kvmppc_update_lpcr(kvm, LPCR_VPM1,
LPCR_VPM1 | LPCR_UPRT | LPCR_GTSE | LPCR_HR);
- kvmppc_rmap_reset(kvm);
- kvm->arch.radix = 0;
- kvm->arch.process_table = 0;
return 0;
}
@@ -4549,12 +4552,14 @@ int kvmppc_switch_mmu_to_radix(struct kvm *kvm)
err = kvmppc_init_vm_radix(kvm);
if (err)
return err;
-
+ kvmppc_rmap_reset(kvm);
+ /* Mutual exclusion with kvm_unmap_hva_range etc. */
+ spin_lock(&kvm->mmu_lock);
+ kvm->arch.radix = 1;
+ spin_unlock(&kvm->mmu_lock);
kvmppc_free_hpt(&kvm->arch.hpt);
kvmppc_update_lpcr(kvm, LPCR_UPRT | LPCR_GTSE | LPCR_HR,
LPCR_VPM1 | LPCR_UPRT | LPCR_GTSE | LPCR_HR);
- kvmppc_rmap_reset(kvm);
- kvm->arch.radix = 1;
return 0;
}
The patch below does not apply to the 4.4-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
>From 5b5e4d623ec8a34689df98e42d038a3b594d2ff9 Mon Sep 17 00:00:00 2001
From: Michal Hocko <mhocko(a)suse.com>
Date: Tue, 13 Nov 2018 19:49:10 +0100
Subject: [PATCH] x86/speculation/l1tf: Drop the swap storage limit restriction
when l1tf=off
Swap storage is restricted to max_swapfile_size (~16TB on x86_64) whenever
the system is deemed affected by L1TF vulnerability. Even though the limit
is quite high for most deployments it seems to be too restrictive for
deployments which are willing to live with the mitigation disabled.
We have a customer to deploy 8x 6,4TB PCIe/NVMe SSD swap devices which is
clearly out of the limit.
Drop the swap restriction when l1tf=off is specified. It also doesn't make
much sense to warn about too much memory for the l1tf mitigation when it is
forcefully disabled by the administrator.
[ tglx: Folded the documentation delta change ]
Fixes: 377eeaa8e11f ("x86/speculation/l1tf: Limit swap file size to MAX_PA/2")
Signed-off-by: Michal Hocko <mhocko(a)suse.com>
Signed-off-by: Thomas Gleixner <tglx(a)linutronix.de>
Reviewed-by: Pavel Tatashin <pasha.tatashin(a)soleen.com>
Reviewed-by: Andi Kleen <ak(a)linux.intel.com>
Acked-by: Jiri Kosina <jkosina(a)suse.cz>
Cc: Linus Torvalds <torvalds(a)linux-foundation.org>
Cc: Dave Hansen <dave.hansen(a)intel.com>
Cc: Andi Kleen <ak(a)linux.intel.com>
Cc: Borislav Petkov <bp(a)suse.de>
Cc: <linux-mm(a)kvack.org>
Cc: stable(a)vger.kernel.org
Link: https://lkml.kernel.org/r/20181113184910.26697-1-mhocko@kernel.org
diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
index 05a252e5178d..835e422572eb 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -2095,6 +2095,9 @@
off
Disables hypervisor mitigations and doesn't
emit any warnings.
+ It also drops the swap size and available
+ RAM limit restriction on both hypervisor and
+ bare metal.
Default is 'flush'.
diff --git a/Documentation/admin-guide/l1tf.rst b/Documentation/admin-guide/l1tf.rst
index b85dd80510b0..9af977384168 100644
--- a/Documentation/admin-guide/l1tf.rst
+++ b/Documentation/admin-guide/l1tf.rst
@@ -405,6 +405,9 @@ time with the option "l1tf=". The valid arguments for this option are:
off Disables hypervisor mitigations and doesn't emit any
warnings.
+ It also drops the swap size and available RAM limit restrictions
+ on both hypervisor and bare metal.
+
============ =============================================================
The default is 'flush'. For details about L1D flushing see :ref:`l1d_flush`.
@@ -576,7 +579,8 @@ Default mitigations
The kernel default mitigations for vulnerable processors are:
- PTE inversion to protect against malicious user space. This is done
- unconditionally and cannot be controlled.
+ unconditionally and cannot be controlled. The swap storage is limited
+ to ~16TB.
- L1D conditional flushing on VMENTER when EPT is enabled for
a guest.
diff --git a/arch/x86/kernel/cpu/bugs.c b/arch/x86/kernel/cpu/bugs.c
index a68b32cb845a..58689ac64440 100644
--- a/arch/x86/kernel/cpu/bugs.c
+++ b/arch/x86/kernel/cpu/bugs.c
@@ -1002,7 +1002,8 @@ static void __init l1tf_select_mitigation(void)
#endif
half_pa = (u64)l1tf_pfn_limit() << PAGE_SHIFT;
- if (e820__mapped_any(half_pa, ULLONG_MAX - half_pa, E820_TYPE_RAM)) {
+ if (l1tf_mitigation != L1TF_MITIGATION_OFF &&
+ e820__mapped_any(half_pa, ULLONG_MAX - half_pa, E820_TYPE_RAM)) {
pr_warn("System has more than MAX_PA/2 memory. L1TF mitigation not effective.\n");
pr_info("You may make it effective by booting the kernel with mem=%llu parameter.\n",
half_pa);
diff --git a/arch/x86/mm/init.c b/arch/x86/mm/init.c
index ef99f3892e1f..427a955a2cf2 100644
--- a/arch/x86/mm/init.c
+++ b/arch/x86/mm/init.c
@@ -931,7 +931,7 @@ unsigned long max_swapfile_size(void)
pages = generic_max_swapfile_size();
- if (boot_cpu_has_bug(X86_BUG_L1TF)) {
+ if (boot_cpu_has_bug(X86_BUG_L1TF) && l1tf_mitigation != L1TF_MITIGATION_OFF) {
/* Limit the swap file size to MAX_PA/2 for L1TF workaround */
unsigned long long l1tf_limit = l1tf_pfn_limit();
/*
The patch below does not apply to the 4.9-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
>From 5b5e4d623ec8a34689df98e42d038a3b594d2ff9 Mon Sep 17 00:00:00 2001
From: Michal Hocko <mhocko(a)suse.com>
Date: Tue, 13 Nov 2018 19:49:10 +0100
Subject: [PATCH] x86/speculation/l1tf: Drop the swap storage limit restriction
when l1tf=off
Swap storage is restricted to max_swapfile_size (~16TB on x86_64) whenever
the system is deemed affected by L1TF vulnerability. Even though the limit
is quite high for most deployments it seems to be too restrictive for
deployments which are willing to live with the mitigation disabled.
We have a customer to deploy 8x 6,4TB PCIe/NVMe SSD swap devices which is
clearly out of the limit.
Drop the swap restriction when l1tf=off is specified. It also doesn't make
much sense to warn about too much memory for the l1tf mitigation when it is
forcefully disabled by the administrator.
[ tglx: Folded the documentation delta change ]
Fixes: 377eeaa8e11f ("x86/speculation/l1tf: Limit swap file size to MAX_PA/2")
Signed-off-by: Michal Hocko <mhocko(a)suse.com>
Signed-off-by: Thomas Gleixner <tglx(a)linutronix.de>
Reviewed-by: Pavel Tatashin <pasha.tatashin(a)soleen.com>
Reviewed-by: Andi Kleen <ak(a)linux.intel.com>
Acked-by: Jiri Kosina <jkosina(a)suse.cz>
Cc: Linus Torvalds <torvalds(a)linux-foundation.org>
Cc: Dave Hansen <dave.hansen(a)intel.com>
Cc: Andi Kleen <ak(a)linux.intel.com>
Cc: Borislav Petkov <bp(a)suse.de>
Cc: <linux-mm(a)kvack.org>
Cc: stable(a)vger.kernel.org
Link: https://lkml.kernel.org/r/20181113184910.26697-1-mhocko@kernel.org
diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
index 05a252e5178d..835e422572eb 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -2095,6 +2095,9 @@
off
Disables hypervisor mitigations and doesn't
emit any warnings.
+ It also drops the swap size and available
+ RAM limit restriction on both hypervisor and
+ bare metal.
Default is 'flush'.
diff --git a/Documentation/admin-guide/l1tf.rst b/Documentation/admin-guide/l1tf.rst
index b85dd80510b0..9af977384168 100644
--- a/Documentation/admin-guide/l1tf.rst
+++ b/Documentation/admin-guide/l1tf.rst
@@ -405,6 +405,9 @@ time with the option "l1tf=". The valid arguments for this option are:
off Disables hypervisor mitigations and doesn't emit any
warnings.
+ It also drops the swap size and available RAM limit restrictions
+ on both hypervisor and bare metal.
+
============ =============================================================
The default is 'flush'. For details about L1D flushing see :ref:`l1d_flush`.
@@ -576,7 +579,8 @@ Default mitigations
The kernel default mitigations for vulnerable processors are:
- PTE inversion to protect against malicious user space. This is done
- unconditionally and cannot be controlled.
+ unconditionally and cannot be controlled. The swap storage is limited
+ to ~16TB.
- L1D conditional flushing on VMENTER when EPT is enabled for
a guest.
diff --git a/arch/x86/kernel/cpu/bugs.c b/arch/x86/kernel/cpu/bugs.c
index a68b32cb845a..58689ac64440 100644
--- a/arch/x86/kernel/cpu/bugs.c
+++ b/arch/x86/kernel/cpu/bugs.c
@@ -1002,7 +1002,8 @@ static void __init l1tf_select_mitigation(void)
#endif
half_pa = (u64)l1tf_pfn_limit() << PAGE_SHIFT;
- if (e820__mapped_any(half_pa, ULLONG_MAX - half_pa, E820_TYPE_RAM)) {
+ if (l1tf_mitigation != L1TF_MITIGATION_OFF &&
+ e820__mapped_any(half_pa, ULLONG_MAX - half_pa, E820_TYPE_RAM)) {
pr_warn("System has more than MAX_PA/2 memory. L1TF mitigation not effective.\n");
pr_info("You may make it effective by booting the kernel with mem=%llu parameter.\n",
half_pa);
diff --git a/arch/x86/mm/init.c b/arch/x86/mm/init.c
index ef99f3892e1f..427a955a2cf2 100644
--- a/arch/x86/mm/init.c
+++ b/arch/x86/mm/init.c
@@ -931,7 +931,7 @@ unsigned long max_swapfile_size(void)
pages = generic_max_swapfile_size();
- if (boot_cpu_has_bug(X86_BUG_L1TF)) {
+ if (boot_cpu_has_bug(X86_BUG_L1TF) && l1tf_mitigation != L1TF_MITIGATION_OFF) {
/* Limit the swap file size to MAX_PA/2 for L1TF workaround */
unsigned long long l1tf_limit = l1tf_pfn_limit();
/*
Hello!
Please apply commit 102cd9096356 ("qmi_wwan: apply SET_DTR quirk to the
SIMCOM shared device ID") to the v4.9 and v4.14 longterm branches.
Note: There is no point in backporting it to earlier releases, as the
affected devices need the "raw-ip" feature introduced in v4.5.
Thanks,
Bjørn
Build failure log of arm - BeagleBoard-X15 board while make modules
Git branch: linux-4.14.y
Git URL: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable-rc.git
Git commit: b8b55f8fd937348f867eb0e4098b6d658959d075
In file included from sound/pci/hda/patch_realtek.c:33:0:
sound/pci/hda/patch_realtek.c:6381:48: error: 'ALC294_FIXUP_ASUS_SPK'
undeclared here (not in a function); did you mean
'ALC256_FIXUP_ASUS_MIC'?
SND_PCI_QUIRK(0x1043, 0x10a1, "ASUS UX391UA", ALC294_FIXUP_ASUS_SPK),
^
include/sound/core.h:415:43: note: in definition of macro 'SND_PCI_QUIRK'
{_SND_PCI_QUIRK_ID(vend, dev), .value = (val)}
^~~
CC [M] drivers/cpufreq/highbank-cpufreq.o
scripts/Makefile.build:326: recipe for target
'sound/pci/hda/patch_realtek.o' failed
make[5]: *** [sound/pci/hda/patch_realtek.o] Error 1
- Naresh
The patch below does not apply to the 4.14-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
>From 94ea56cff506c769a509c5dd87904c7fe3806a81 Mon Sep 17 00:00:00 2001
From: Hans de Goede <hdegoede(a)redhat.com>
Date: Mon, 3 Dec 2018 21:45:14 +0100
Subject: [PATCH] ASoC: intel: cht_bsw_max98090_ti: Add pmc_plt_clk_0 quirk for
Chromebook Gnawty
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
The Gnawty model Chromebook uses pmc_plt_clk_0 instead of pmc_plt_clk_3
for the mclk, just like the Clapper and Swanky models.
This commit adds a DMI based quirk for this.
This fixing audio no longer working on these devices after
commit 648e921888ad ("clk: x86: Stop marking clocks as CLK_IS_CRITICAL")
that commit fixes us unnecessary keeping unused clocks on, but in case of
the Gnawty that was breaking audio support since we were not using the
right clock in the cht_bsw_max98090_ti machine driver.
BugLink: https://bugzilla.kernel.org/show_bug.cgi?id=201787
Cc: stable(a)vger.kernel.org
Fixes: 648e921888ad ("clk: x86: Stop marking clocks as CLK_IS_CRITICAL")
Reported-and-tested-by: Jaime Pérez <19.jaime.91(a)gmail.com>
Signed-off-by: Hans de Goede <hdegoede(a)redhat.com>
Acked-by: Pierre-Louis Bossart <pierre-louis.bossart(a)linux.intel.com>
Signed-off-by: Mark Brown <broonie(a)kernel.org>
diff --git a/sound/soc/intel/boards/cht_bsw_max98090_ti.c b/sound/soc/intel/boards/cht_bsw_max98090_ti.c
index ad0c98383853..08a5152e635a 100644
--- a/sound/soc/intel/boards/cht_bsw_max98090_ti.c
+++ b/sound/soc/intel/boards/cht_bsw_max98090_ti.c
@@ -396,6 +396,13 @@ static const struct dmi_system_id cht_max98090_quirk_table[] = {
},
.driver_data = (void *)QUIRK_PMC_PLT_CLK_0,
},
+ {
+ /* Gnawty model Chromebook (Acer Chromebook CB3-111) */
+ .matches = {
+ DMI_MATCH(DMI_PRODUCT_NAME, "Gnawty"),
+ },
+ .driver_data = (void *)QUIRK_PMC_PLT_CLK_0,
+ },
{
/* Swanky model Chromebook (Toshiba Chromebook 2) */
.matches = {
The patch below does not apply to the 4.14-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
>From 984bfb398a3af6fa9b7e80165e524933b0616686 Mon Sep 17 00:00:00 2001
From: Hans de Goede <hdegoede(a)redhat.com>
Date: Sun, 2 Dec 2018 13:21:22 +0100
Subject: [PATCH] ASoC: intel: cht_bsw_max98090_ti: Add pmc_plt_clk_0 quirk for
Chromebook Clapper
The Clapper model Chromebook uses pmc_plt_clk_0 instead of pmc_plt_clk_3
for the mclk, just like the Swanky model.
This commit adds a DMI based quirk for this.
This fixing audio no longer working on these devices after
commit 648e921888ad ("clk: x86: Stop marking clocks as CLK_IS_CRITICAL")
that commit fixes us unnecessary keeping unused clocks on, but in case of
the Clapper that was breaking audio support since we were not using the
right clock in the cht_bsw_max98090_ti machine driver.
Cc: stable(a)vger.kernel.org
Fixes: 648e921888ad ("clk: x86: Stop marking clocks as CLK_IS_CRITICAL")
Signed-off-by: Hans de Goede <hdegoede(a)redhat.com>
Acked-by: Pierre-Louis Bossart <pierre-louis.bossart(a)linux.intel.com>
Signed-off-by: Mark Brown <broonie(a)kernel.org>
diff --git a/sound/soc/intel/boards/cht_bsw_max98090_ti.c b/sound/soc/intel/boards/cht_bsw_max98090_ti.c
index 9d9f6e41d81c..ad0c98383853 100644
--- a/sound/soc/intel/boards/cht_bsw_max98090_ti.c
+++ b/sound/soc/intel/boards/cht_bsw_max98090_ti.c
@@ -389,6 +389,13 @@ static struct snd_soc_card snd_soc_card_cht = {
};
static const struct dmi_system_id cht_max98090_quirk_table[] = {
+ {
+ /* Clapper model Chromebook */
+ .matches = {
+ DMI_MATCH(DMI_PRODUCT_NAME, "Clapper"),
+ },
+ .driver_data = (void *)QUIRK_PMC_PLT_CLK_0,
+ },
{
/* Swanky model Chromebook (Toshiba Chromebook 2) */
.matches = {
The patch below does not apply to the 4.19-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
>From b637ef779575a977068025f842ecd480a9671f3f Mon Sep 17 00:00:00 2001
From: Boris Brezillon <boris.brezillon(a)bootlin.com>
Date: Thu, 13 Dec 2018 11:55:26 +0100
Subject: [PATCH] mtd: rawnand: Fix JEDEC detection
nand_jedec_detect() should return 1 when the PARAM page parsing
succeeds, otherwise the core considers JEDEC detection failed and falls
back to ID-based detection.
Fixes: 480139d9229e ("mtd: rawnand: get rid of the JEDEC parameter page in nand_chip")
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Boris Brezillon <boris.brezillon(a)bootlin.com>
Acked-by: Miquel Raynal <miquel.raynal(a)bootlin.com>
Signed-off-by: Miquel Raynal <miquel.raynal(a)bootlin.com>
diff --git a/drivers/mtd/nand/raw/nand_jedec.c b/drivers/mtd/nand/raw/nand_jedec.c
index 5c26492c841d..38b5dc22cb30 100644
--- a/drivers/mtd/nand/raw/nand_jedec.c
+++ b/drivers/mtd/nand/raw/nand_jedec.c
@@ -107,6 +107,8 @@ int nand_jedec_detect(struct nand_chip *chip)
pr_warn("Invalid codeword size\n");
}
+ ret = 1;
+
free_jedec_param_page:
kfree(p);
return ret;
On Fri, Jan 04, 2019 at 09:03:10PM +0000, Sasha Levin wrote:
> Hi,
>
> [This is an automated email]
>
> This commit has been processed because it contains a "Fixes:" tag,
> fixing commit: a7d85e06ed80 crypto: cfb - add support for Cipher FeedBack mode.
>
> The bot has tested the following trees: v4.20.0, v4.19.13.
>
> v4.20.0: Failed to apply! Possible dependencies:
> 7da66670775d ("crypto: testmgr - add AES-CFB tests")
>
> v4.19.13: Failed to apply! Possible dependencies:
> 7da66670775d ("crypto: testmgr - add AES-CFB tests")
> dfb89ab3f0a7 ("crypto: tcrypt - add OFB functional tests")
>
>
> How should we proceed with this patch?
>
> --
> Thanks,
> Sasha
The following will need to be applied to 4.19 and 4.20 first. Both had Cc stable:
fa4600734b74 ("crypto: cfb - fix decryption")
7da66670775d ("crypto: testmgr - add AES-CFB tests")
Herbert, why was CFB accepted without any test vectors in the first place?
- Eric
The patch titled
Subject: mm/vmalloc: fix size check for remap_vmalloc_range_partial()
has been added to the -mm tree. Its filename is
mm-vmalloc-fix-size-check-for-remap_vmalloc_range_partial.patch
This patch should soon appear at
http://ozlabs.org/~akpm/mmots/broken-out/mm-vmalloc-fix-size-check-for-rema…
and later at
http://ozlabs.org/~akpm/mmotm/broken-out/mm-vmalloc-fix-size-check-for-rema…
Before you just go and hit "reply", please:
a) Consider who else should be cc'ed
b) Prefer to cc a suitable mailing list as well
c) Ideally: find the original patch on the mailing list and do a
reply-to-all to that, adding suitable additional cc's
*** Remember to use Documentation/process/submit-checklist.rst when testing your code ***
The -mm tree is included into linux-next and is updated
there every 3-4 working days
------------------------------------------------------
From: Roman Penyaev <rpenyaev(a)suse.de>
Subject: mm/vmalloc: fix size check for remap_vmalloc_range_partial()
When VM_NO_GUARD is not set area->size includes adjacent guard page,
thus for correct size checking get_vm_area_size() should be used, but
not area->size.
This fixes possible kernel oops when userspace tries to mmap an area on
1 page bigger than was allocated by vmalloc_user() call: the size check
inside remap_vmalloc_range_partial() accounts non-existing guard page
also, so check successfully passes but vmalloc_to_page() returns NULL
(guard page does not physically exist).
The following code pattern example should trigger an oops:
static int oops_mmap(struct file *file, struct vm_area_struct *vma)
{
void *mem;
mem = vmalloc_user(4096);
BUG_ON(!mem);
/* Do not care about mem leak */
return remap_vmalloc_range(vma, mem, 0);
}
And userspace simply mmaps size + PAGE_SIZE:
mmap(NULL, 8192, PROT_WRITE|PROT_READ, MAP_PRIVATE, fd, 0);
Possible candidates for oops which do not have any explicit size
checks:
*** drivers/media/usb/stkwebcam/stk-webcam.c:
v4l_stk_mmap[789] ret = remap_vmalloc_range(vma, sbuf->buffer, 0);
Or the following one:
*** drivers/video/fbdev/core/fbmem.c
static int
fb_mmap(struct file *file, struct vm_area_struct * vma)
...
res = fb->fb_mmap(info, vma);
Where fb_mmap callback calls remap_vmalloc_range() directly without any
explicit checks:
*** drivers/video/fbdev/vfb.c
static int vfb_mmap(struct fb_info *info,
struct vm_area_struct *vma)
{
return remap_vmalloc_range(vma, (void *)info->fix.smem_start, vma->vm_pgoff);
}
Link: http://lkml.kernel.org/r/20190103145954.16942-2-rpenyaev@suse.de
Signed-off-by: Roman Penyaev <rpenyaev(a)suse.de>
Acked-by: Michal Hocko <mhocko(a)suse.com>
Cc: Andrey Ryabinin <aryabinin(a)virtuozzo.com>
Cc: Joe Perches <joe(a)perches.com>
Cc: "Luis R. Rodriguez" <mcgrof(a)kernel.org>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
--- a/mm/vmalloc.c~mm-vmalloc-fix-size-check-for-remap_vmalloc_range_partial
+++ a/mm/vmalloc.c
@@ -2248,7 +2248,7 @@ int remap_vmalloc_range_partial(struct v
if (!(area->flags & VM_USERMAP))
return -EINVAL;
- if (kaddr + size > area->addr + area->size)
+ if (kaddr + size > area->addr + get_vm_area_size(area))
return -EINVAL;
do {
_
Patches currently in -mm which might be from rpenyaev(a)suse.de are
epoll-make-sure-all-elements-in-ready-list-are-in-fifo-order.patch
epoll-loosen-irq-safety-in-ep_poll_callback.patch
epoll-unify-awaking-of-wakeup-source-on-ep_poll_callback-path.patch
epoll-use-rwlock-in-order-to-reduce-ep_poll_callback-contention.patch
mm-vmalloc-fix-size-check-for-remap_vmalloc_range_partial.patch
mm-vmalloc-do-not-call-kmemleak_free-on-not-yet-accounted-memory.patch
mm-vmalloc-pass-vm_usermap-flags-directly-to-__vmalloc_node_range.patch
The patch titled
Subject: mm: page_mapped: don't assume compound page is huge or THP
has been added to the -mm tree. Its filename is
mm-page_mapped-dont-assume-compound-page-is-huge-or-thp.patch
This patch should soon appear at
http://ozlabs.org/~akpm/mmots/broken-out/mm-page_mapped-dont-assume-compoun…
and later at
http://ozlabs.org/~akpm/mmotm/broken-out/mm-page_mapped-dont-assume-compoun…
Before you just go and hit "reply", please:
a) Consider who else should be cc'ed
b) Prefer to cc a suitable mailing list as well
c) Ideally: find the original patch on the mailing list and do a
reply-to-all to that, adding suitable additional cc's
*** Remember to use Documentation/process/submit-checklist.rst when testing your code ***
The -mm tree is included into linux-next and is updated
there every 3-4 working days
------------------------------------------------------
From: Jan Stancek <jstancek(a)redhat.com>
Subject: mm: page_mapped: don't assume compound page is huge or THP
LTP proc01 testcase has been observed to rarely trigger crashes
on arm64:
page_mapped+0x78/0xb4
stable_page_flags+0x27c/0x338
kpageflags_read+0xfc/0x164
proc_reg_read+0x7c/0xb8
__vfs_read+0x58/0x178
vfs_read+0x90/0x14c
SyS_read+0x60/0xc0
Issue is that page_mapped() assumes that if compound page is not huge,
then it must be THP. But if this is 'normal' compound page
(COMPOUND_PAGE_DTOR), then following loop can keep running (for
HPAGE_PMD_NR iterations) until it tries to read from memory that isn't
mapped and triggers a panic:
for (i = 0; i < hpage_nr_pages(page); i++) {
if (atomic_read(&page[i]._mapcount) >= 0)
return true;
}
I could replicate this on x86 (v4.20-rc4-98-g60b548237fed) only
with a custom kernel module [1] which:
- allocates compound page (PAGEC) of order 1
- allocates 2 normal pages (COPY), which are initialized to 0xff
(to satisfy _mapcount >= 0)
- 2 PAGEC page structs are copied to address of first COPY page
- second page of COPY is marked as not present
- call to page_mapped(COPY) now triggers fault on access to 2nd
COPY page at offset 0x30 (_mapcount)
[1] https://github.com/jstancek/reproducers/blob/master/kernel/page_mapped_cras…
Fix the loop to iterate for "1 << compound_order" pages.
Kirrill said "IIRC, sound subsystem can producuce custom mapped compound
pages".
Link: http://lkml.kernel.org/r/c440d69879e34209feba21e12d236d06bc0a25db.154357715…
Fixes: e1534ae95004 ("mm: differentiate page_mapped() from page_mapcount() for compound pages")
Signed-off-by: Jan Stancek <jstancek(a)redhat.com>
Debugged-by: Laszlo Ersek <lersek(a)redhat.com>
Suggested-by: "Kirill A. Shutemov" <kirill(a)shutemov.name>
Acked-by: Michal Hocko <mhocko(a)suse.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov(a)linux.intel.com>
Reviewed-by: David Hildenbrand <david(a)redhat.com>
Reviewed-by: Andrea Arcangeli <aarcange(a)redhat.com>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
--- a/mm/util.c~mm-page_mapped-dont-assume-compound-page-is-huge-or-thp
+++ a/mm/util.c
@@ -478,7 +478,7 @@ bool page_mapped(struct page *page)
return true;
if (PageHuge(page))
return false;
- for (i = 0; i < hpage_nr_pages(page); i++) {
+ for (i = 0; i < (1 << compound_order(page)); i++) {
if (atomic_read(&page[i]._mapcount) >= 0)
return true;
}
_
Patches currently in -mm which might be from jstancek(a)redhat.com are
mm-page_mapped-dont-assume-compound-page-is-huge-or-thp.patch
Commit 5eed6f1dff87 ("fork,memcg: fix crash in free_thread_stack on
memcg charge fail") fixes a crash caused due to failed memcg charge of
the kernel stack. However the fix misses the cached_stacks case which
this patch fixes. So, the same crash can happen if the memcg charge of
a cached stack is failed.
Fixes: 5eed6f1dff87 ("fork,memcg: fix crash in free_thread_stack on memcg charge fail")
Signed-off-by: Shakeel Butt <shakeelb(a)google.com>
Cc: Rik van Riel <riel(a)surriel.com>
Cc: Roman Gushchin <guro(a)fb.com>
Cc: Michal Hocko <mhocko(a)suse.com>
Cc: Johannes Weiner <hannes(a)cmpxchg.org>
Cc: Tejun Heo <tj(a)kernel.org>
Cc: <stable(a)vger.kernel.org>
---
kernel/fork.c | 1 +
1 file changed, 1 insertion(+)
diff --git a/kernel/fork.c b/kernel/fork.c
index e4a51124661a..593cd1577dff 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -216,6 +216,7 @@ static unsigned long *alloc_thread_stack_node(struct task_struct *tsk, int node)
memset(s->addr, 0, THREAD_SIZE);
tsk->stack_vm_area = s;
+ tsk->stack = s->addr;
return s->addr;
}
--
2.20.1.415.g653613c723-goog
From: Oliver Hartkopp <socketcan(a)hartkopp.net>
Muyu Yu provided a POC where user root with CAP_NET_ADMIN can create a CAN
frame modification rule that makes the data length code a higher value than
the available CAN frame data size. In combination with a configured checksum
calculation where the result is stored relatively to the end of the data
(e.g. cgw_csum_xor_rel) the tail of the skb (e.g. frag_list pointer in
skb_shared_info) can be rewritten which finally can cause a system crash.
Michael Kubecek suggested to drop frames that have a DLC exceeding the
available space after the modification process and provided a patch that can
handle CAN FD frames too. Within this patch we also limit the length for the
checksum calculations to the maximum of Classic CAN data length (8).
CAN frames that are dropped by these additional checks are counted with the
CGW_DELETED counter which indicates misconfigurations in can-gw rules.
This fixes CVE-2019-3701.
Reported-by: Muyu Yu <ieatmuttonchuan(a)gmail.com>
Reported-by: Marcus Meissner <meissner(a)suse.de>
Suggested-by: Michal Kubecek <mkubecek(a)suse.cz>
Tested-by: Muyu Yu <ieatmuttonchuan(a)gmail.com>
Tested-by: Oliver Hartkopp <socketcan(a)hartkopp.net>
Signed-off-by: Oliver Hartkopp <socketcan(a)hartkopp.net>
Cc: linux-stable <stable(a)vger.kernel.org> # >= v3.2
Signed-off-by: Marc Kleine-Budde <mkl(a)pengutronix.de>
---
Hello,
I've removed the else from dlc length check. Keeps the code and the
patch more readable.
Marc
net/can/gw.c | 29 ++++++++++++++++++++++++++---
1 file changed, 26 insertions(+), 3 deletions(-)
diff --git a/net/can/gw.c b/net/can/gw.c
index faa3da88a127..bb85d815f092 100644
--- a/net/can/gw.c
+++ b/net/can/gw.c
@@ -416,13 +416,28 @@ static void can_can_gw_rcv(struct sk_buff *skb, void *data)
while (modidx < MAX_MODFUNCTIONS && gwj->mod.modfunc[modidx])
(*gwj->mod.modfunc[modidx++])(cf, &gwj->mod);
- /* check for checksum updates when the CAN frame has been modified */
+ /* Has the CAN frame been modified? */
if (modidx) {
- if (gwj->mod.csumfunc.crc8)
+ /* get available space for the processed CAN frame type */
+ int max_len = nskb->len - offsetof(struct can_frame, data);
+
+ /* dlc may have changed, make sure it fits to the CAN frame */
+ if (cf->can_dlc > max_len)
+ goto out_delete;
+
+ /* check for checksum updates in classic CAN length only */
+ if (gwj->mod.csumfunc.crc8) {
+ if (cf->can_dlc > 8)
+ goto out_delete;
+
(*gwj->mod.csumfunc.crc8)(cf, &gwj->mod.csum.crc8);
+ }
- if (gwj->mod.csumfunc.xor)
+ if (gwj->mod.csumfunc.xor) {
+ if (cf->can_dlc > 8)
+ goto out_delete;
(*gwj->mod.csumfunc.xor)(cf, &gwj->mod.csum.xor);
+ }
}
/* clear the skb timestamp if not configured the other way */
@@ -434,6 +449,14 @@ static void can_can_gw_rcv(struct sk_buff *skb, void *data)
gwj->dropped_frames++;
else
gwj->handled_frames++;
+
+ return;
+
+ out_delete:
+ /* delete frame due to misconfiguration */
+ gwj->deleted_frames++;
+ kfree_skb(nskb);
+ return;
}
static inline int cgw_register_filter(struct net *net, struct cgw_job *gwj)
--
2.20.1
Muyu Yu provided a POC where user root with CAP_NET_ADMIN can create a CAN
frame modification rule that makes the data length code a higher value than
the available CAN frame data size. In combination with a configured checksum
calculation where the result is stored relatively to the end of the data
(e.g. cgw_csum_xor_rel) the tail of the skb (e.g. frag_list pointer in
skb_shared_info) can be rewritten which finally can cause a system crash.
Michael Kubecek suggested to drop frames that have a DLC exceeding the
available space after the modification process and provided a patch that can
handle CAN FD frames too. Within this patch we also limit the length for the
checksum calculations to the maximum of Classic CAN data length (8).
CAN frames that are dropped by these additional checks are counted with the
CGW_DELETED counter which indicates misconfigurations in can-gw rules.
This fixes CVE-2019-3701.
Reported-by: Muyu Yu <ieatmuttonchuan(a)gmail.com>
Reported-by: Marcus Meissner <meissner(a)suse.de>
Suggested-by: Michal Kubecek <mkubecek(a)suse.cz>
Tested-by: Muyu Yu <ieatmuttonchuan(a)gmail.com>
Tested-by: Oliver Hartkopp <socketcan(a)hartkopp.net>
Signed-off-by: Oliver Hartkopp <socketcan(a)hartkopp.net>
Cc: linux-stable <stable(a)vger.kernel.org> # >= v3.2
---
net/can/gw.c | 36 +++++++++++++++++++++++++++++++-----
1 file changed, 31 insertions(+), 5 deletions(-)
diff --git a/net/can/gw.c b/net/can/gw.c
index faa3da88a127..180c389af5b1 100644
--- a/net/can/gw.c
+++ b/net/can/gw.c
@@ -416,13 +416,31 @@ static void can_can_gw_rcv(struct sk_buff *skb, void *data)
while (modidx < MAX_MODFUNCTIONS && gwj->mod.modfunc[modidx])
(*gwj->mod.modfunc[modidx++])(cf, &gwj->mod);
- /* check for checksum updates when the CAN frame has been modified */
+ /* Has the CAN frame been modified? */
if (modidx) {
- if (gwj->mod.csumfunc.crc8)
- (*gwj->mod.csumfunc.crc8)(cf, &gwj->mod.csum.crc8);
+ /* get available space for the processed CAN frame type */
+ int max_len = nskb->len - offsetof(struct can_frame, data);
- if (gwj->mod.csumfunc.xor)
- (*gwj->mod.csumfunc.xor)(cf, &gwj->mod.csum.xor);
+ /* dlc may have changed, make sure it fits to the CAN frame */
+ if (cf->can_dlc > max_len)
+ goto out_delete;
+
+ /* check for checksum updates in classic CAN length only */
+ if (gwj->mod.csumfunc.crc8) {
+ if (cf->can_dlc > 8)
+ goto out_delete;
+ else
+ (*gwj->mod.csumfunc.crc8)
+ (cf, &gwj->mod.csum.crc8);
+ }
+
+ if (gwj->mod.csumfunc.xor) {
+ if (cf->can_dlc > 8)
+ goto out_delete;
+ else
+ (*gwj->mod.csumfunc.xor)
+ (cf, &gwj->mod.csum.xor);
+ }
}
/* clear the skb timestamp if not configured the other way */
@@ -434,6 +452,14 @@ static void can_can_gw_rcv(struct sk_buff *skb, void *data)
gwj->dropped_frames++;
else
gwj->handled_frames++;
+
+ return;
+
+out_delete:
+ /* delete frame due to misconfiguration */
+ gwj->deleted_frames++;
+ kfree_skb(nskb);
+ return;
}
static inline int cgw_register_filter(struct net *net, struct cgw_job *gwj)
--
2.19.2
The ARM Linux kernel handles the EABI syscall numbers as follows:
0 - NR_SYSCALLS-1 : Invoke syscall via syscall table
NR_SYSCALLS - 0xeffff : -ENOSYS (to be allocated in future)
0xf0000 - 0xf07ff : Private syscall or -ENOSYS if not allocated
> 0xf07ff : SIGILL
Our compat code gets this wrong and ends up sending SIGILL in response
to all syscalls greater than NR_SYSCALLS which have a value greater
than 0x7ff in the bottom 16 bits.
Fix this by defining the end of the ARM private syscall region and
checking the syscall number against that directly. Update the comment
while we're at it.
Cc: <stable(a)vger.kernel.org>
Cc: Dave Martin <Dave.Martin(a)arm.com>
Reported-by: Pi-Hsun Shih <pihsun(a)chromium.org>
Signed-off-by: Will Deacon <will.deacon(a)arm.com>
---
arch/arm64/include/asm/unistd.h | 5 +++--
arch/arm64/kernel/sys_compat.c | 4 ++--
2 files changed, 5 insertions(+), 4 deletions(-)
diff --git a/arch/arm64/include/asm/unistd.h b/arch/arm64/include/asm/unistd.h
index b13ca091f833..be66a54ee3a1 100644
--- a/arch/arm64/include/asm/unistd.h
+++ b/arch/arm64/include/asm/unistd.h
@@ -40,8 +40,9 @@
* The following SVCs are ARM private.
*/
#define __ARM_NR_COMPAT_BASE 0x0f0000
-#define __ARM_NR_compat_cacheflush (__ARM_NR_COMPAT_BASE+2)
-#define __ARM_NR_compat_set_tls (__ARM_NR_COMPAT_BASE+5)
+#define __ARM_NR_compat_cacheflush (__ARM_NR_COMPAT_BASE + 2)
+#define __ARM_NR_compat_set_tls (__ARM_NR_COMPAT_BASE + 5)
+#define __ARM_NR_compat_syscall_end (__ARM_NR_COMPAT_BASE + 0x800)
#define __NR_compat_syscalls 399
#endif
diff --git a/arch/arm64/kernel/sys_compat.c b/arch/arm64/kernel/sys_compat.c
index 32653d156747..5972b7533fa0 100644
--- a/arch/arm64/kernel/sys_compat.c
+++ b/arch/arm64/kernel/sys_compat.c
@@ -102,12 +102,12 @@ long compat_arm_syscall(struct pt_regs *regs)
default:
/*
- * Calls 9f00xx..9f07ff are defined to return -ENOSYS
+ * Calls 0xf0xxx..0xf07ff are defined to return -ENOSYS
* if not implemented, rather than raising SIGILL. This
* way the calling program can gracefully determine whether
* a feature is supported.
*/
- if ((no & 0xffff) <= 0x7ff)
+ if (no < __ARM_NR_compat_syscall_end)
return -ENOSYS;
break;
}
--
2.1.4
The syscall number may have been changed by a tracer, so we should pass
the actual number in from the caller instead of pulling it from the
saved r7 value directly.
Cc: <stable(a)vger.kernel.org>
Cc: Dave Martin <Dave.Martin(a)arm.com>
Cc: Pi-Hsun Shih <pihsun(a)chromium.org>
Signed-off-by: Will Deacon <will.deacon(a)arm.com>
---
arch/arm64/kernel/sys_compat.c | 9 ++++-----
arch/arm64/kernel/syscall.c | 9 ++++-----
2 files changed, 8 insertions(+), 10 deletions(-)
diff --git a/arch/arm64/kernel/sys_compat.c b/arch/arm64/kernel/sys_compat.c
index 5972b7533fa0..54c29cd38ff9 100644
--- a/arch/arm64/kernel/sys_compat.c
+++ b/arch/arm64/kernel/sys_compat.c
@@ -66,12 +66,11 @@ do_compat_cache_op(unsigned long start, unsigned long end, int flags)
/*
* Handle all unrecognised system calls.
*/
-long compat_arm_syscall(struct pt_regs *regs)
+long compat_arm_syscall(struct pt_regs *regs, int scno)
{
- unsigned int no = regs->regs[7];
void __user *addr;
- switch (no) {
+ switch (scno) {
/*
* Flush a region from virtual address 'r0' to virtual address 'r1'
* _exclusive_. There is no alignment requirement on either address;
@@ -107,7 +106,7 @@ long compat_arm_syscall(struct pt_regs *regs)
* way the calling program can gracefully determine whether
* a feature is supported.
*/
- if (no < __ARM_NR_compat_syscall_end)
+ if (scno < __ARM_NR_compat_syscall_end)
return -ENOSYS;
break;
}
@@ -116,6 +115,6 @@ long compat_arm_syscall(struct pt_regs *regs)
(compat_thumb_mode(regs) ? 2 : 4);
arm64_notify_die("Oops - bad compat syscall(2)", regs,
- SIGILL, ILL_ILLTRP, addr, no);
+ SIGILL, ILL_ILLTRP, addr, scno);
return 0;
}
diff --git a/arch/arm64/kernel/syscall.c b/arch/arm64/kernel/syscall.c
index 032d22312881..5610ac01c1ec 100644
--- a/arch/arm64/kernel/syscall.c
+++ b/arch/arm64/kernel/syscall.c
@@ -13,16 +13,15 @@
#include <asm/thread_info.h>
#include <asm/unistd.h>
-long compat_arm_syscall(struct pt_regs *regs);
-
+long compat_arm_syscall(struct pt_regs *regs, int scno);
long sys_ni_syscall(void);
-asmlinkage long do_ni_syscall(struct pt_regs *regs)
+static long do_ni_syscall(struct pt_regs *regs, int scno)
{
#ifdef CONFIG_COMPAT
long ret;
if (is_compat_task()) {
- ret = compat_arm_syscall(regs);
+ ret = compat_arm_syscall(regs, scno);
if (ret != -ENOSYS)
return ret;
}
@@ -47,7 +46,7 @@ static void invoke_syscall(struct pt_regs *regs, unsigned int scno,
syscall_fn = syscall_table[array_index_nospec(scno, sc_nr)];
ret = __invoke_syscall(regs, syscall_fn);
} else {
- ret = do_ni_syscall(regs);
+ ret = do_ni_syscall(regs, scno);
}
regs->regs[0] = ret;
--
2.1.4
The CAN frame modification rules allow bitwise logical operations which can
be also applied to the can_dlc field. Ensure the manipulation result to
maintain the can_dlc boundaries so that the CAN drivers do not accidently
write arbitrary content beyond the data registers in the CAN controllers
I/O mem when processing can-gw manipulated outgoing frames. When passing these
frames to user space this issue did not have any effect to the kernel or any
leaked data as we always strictly copy sizeof(struct can_frame) bytes.
Reported-by: Muyu Yu <ieatmuttonchuan(a)gmail.com>
Reported-by: Marcus Meissner <meissner(a)suse.de>
Tested-by: Muyu Yu <ieatmuttonchuan(a)gmail.com>
Signed-off-by: Oliver Hartkopp <socketcan(a)hartkopp.net>
Cc: linux-stable <stable(a)vger.kernel.org> # >= v3.2
---
net/can/gw.c | 4 ++++
1 file changed, 4 insertions(+)
diff --git a/net/can/gw.c b/net/can/gw.c
index faa3da88a127..9000d9b8a133 100644
--- a/net/can/gw.c
+++ b/net/can/gw.c
@@ -418,6 +418,10 @@ static void can_can_gw_rcv(struct sk_buff *skb, void *data)
/* check for checksum updates when the CAN frame has been modified */
if (modidx) {
+ /* ensure DLC boundaries after the different mods */
+ if (cf->can_dlc > 8)
+ cf->can_dlc = 8;
+
if (gwj->mod.csumfunc.crc8)
(*gwj->mod.csumfunc.crc8)(cf, &gwj->mod.csum.crc8);
--
2.19.2
Editing for photos. Retouching for photos. Cutting out for photos,
We are a photo team of 20 image editors and we can edit your photos today
if you need help,
We mainly work on:
Clipping path - image cut out
Shadow creation
Image masking
Image retouching
Beauty model retouching
Please reply if interested.
Thanks,
Jessica
Editing for photos. Retouching for photos. Cutting out for photos,
We are a photo team of 20 image editors and we can edit your photos today
if you need help,
We mainly work on:
Clipping path - image cut out
Shadow creation
Image masking
Image retouching
Beauty model retouching
Please reply if interested.
Thanks,
Jessica
Editing for photos. Retouching for photos. Cutting out for photos,
We are a photo team of 20 image editors and we can edit your photos today
if you need help,
We mainly work on:
Clipping path - image cut out
Shadow creation
Image masking
Image retouching
Beauty model retouching
Please reply if interested.
Thanks,
Jessica
From: Stanley Chu <stanley.chu(a)mediatek.com>
The commit 356fd2663cff ("scsi: Set request queue runtime PM status
back to active on resume") fixed up the inconsistent RPM status between
request queue and device. However changing request queue RPM status
shall be done only on successful resume, otherwise status may be still
inconsistent as below,
Request queue: RPM_ACTIVE
Device: RPM_SUSPENDED
This ends up soft lockup because requests can be submitted to
underlying devices but those devices and their required resource
are not resumed.
For example,
After above inconsistent status happens, IO request can be submitted
to UFS device driver but required resource (like clock) is not resumed
yet thus lead to warning as below call stack,
WARN_ON(hba->clk_gating.state != CLKS_ON);
ufshcd_queuecommand
scsi_dispatch_cmd
scsi_request_fn
__blk_run_queue
cfq_insert_request
__elv_add_request
blk_flush_plug_list
blk_finish_plug
jbd2_journal_commit_transaction
kjournald2
We may see all behind IO requests hang because of no response from
storage host or device and then soft lockup happens in system. In the
end, system may crash in many ways.
Fixes: 356fd2663cff (scsi: Set request queue runtime PM status
back to active on resume)
Cc: stable(a)vger.kernel.org
Signed-off-by: Stanley Chu <stanley.chu(a)mediatek.com>
---
drivers/scsi/scsi_pm.c | 26 +++++++++++++++-----------
1 file changed, 15 insertions(+), 11 deletions(-)
diff --git a/drivers/scsi/scsi_pm.c b/drivers/scsi/scsi_pm.c
index a2b4179bfdf7..7639df91b110 100644
--- a/drivers/scsi/scsi_pm.c
+++ b/drivers/scsi/scsi_pm.c
@@ -80,8 +80,22 @@ static int scsi_dev_type_resume(struct device *dev,
if (err == 0) {
pm_runtime_disable(dev);
- pm_runtime_set_active(dev);
+ err = pm_runtime_set_active(dev);
pm_runtime_enable(dev);
+
+ /*
+ * Forcibly set runtime PM status of request queue to "active"
+ * to make sure we can again get requests from the queue
+ * (see also blk_pm_peek_request()).
+ *
+ * The resume hook will correct runtime PM status of the disk.
+ */
+ if (!err && scsi_is_sdev_device(dev)) {
+ struct scsi_device *sdev = to_scsi_device(dev);
+
+ if (sdev->request_queue->dev)
+ blk_set_runtime_active(sdev->request_queue);
+ }
}
return err;
@@ -140,16 +154,6 @@ static int scsi_bus_resume_common(struct device *dev,
else
fn = NULL;
- /*
- * Forcibly set runtime PM status of request queue to "active" to
- * make sure we can again get requests from the queue (see also
- * blk_pm_peek_request()).
- *
- * The resume hook will correct runtime PM status of the disk.
- */
- if (scsi_is_sdev_device(dev) && pm_runtime_suspended(dev))
- blk_set_runtime_active(to_scsi_device(dev)->request_queue);
-
if (fn) {
async_schedule_domain(fn, dev, &scsi_sd_pm_domain);
--
2.18.0
From: Eric Biggers <ebiggers(a)google.com>
The memcpy() in crypto_cfb_decrypt_inplace() uses walk->iv as both the
source and destination, which has undefined behavior. It is unneeded
because walk->iv is already used to hold the previous ciphertext block;
thus, walk->iv is already updated to its final value. So, remove it.
Also, note that in-place decryption is the only case where the previous
ciphertext block is not directly available. Therefore, as a related
cleanup I also updated crypto_cfb_encrypt_segment() to directly use the
previous ciphertext block rather than save it into walk->iv. This makes
it consistent with in-place encryption and out-of-place decryption; now
only in-place decryption is different, because it has to be.
Fixes: a7d85e06ed80 ("crypto: cfb - add support for Cipher FeedBack mode")
Cc: <stable(a)vger.kernel.org> # v4.17+
Cc: James Bottomley <James.Bottomley(a)HansenPartnership.com>
Signed-off-by: Eric Biggers <ebiggers(a)google.com>
---
crypto/cfb.c | 8 ++++----
1 file changed, 4 insertions(+), 4 deletions(-)
diff --git a/crypto/cfb.c b/crypto/cfb.c
index 183e8a9c33128..4abfe32ff8451 100644
--- a/crypto/cfb.c
+++ b/crypto/cfb.c
@@ -77,12 +77,14 @@ static int crypto_cfb_encrypt_segment(struct skcipher_walk *walk,
do {
crypto_cfb_encrypt_one(tfm, iv, dst);
crypto_xor(dst, src, bsize);
- memcpy(iv, dst, bsize);
+ iv = dst;
src += bsize;
dst += bsize;
} while ((nbytes -= bsize) >= bsize);
+ memcpy(walk->iv, iv, bsize);
+
return nbytes;
}
@@ -162,7 +164,7 @@ static int crypto_cfb_decrypt_inplace(struct skcipher_walk *walk,
const unsigned int bsize = crypto_cfb_bsize(tfm);
unsigned int nbytes = walk->nbytes;
u8 *src = walk->src.virt.addr;
- u8 *iv = walk->iv;
+ u8 * const iv = walk->iv;
u8 tmp[MAX_CIPHER_BLOCKSIZE];
do {
@@ -172,8 +174,6 @@ static int crypto_cfb_decrypt_inplace(struct skcipher_walk *walk,
src += bsize;
} while ((nbytes -= bsize) >= bsize);
- memcpy(walk->iv, iv, bsize);
-
return nbytes;
}
--
2.20.1
From: Eric Biggers <ebiggers(a)google.com>
Like some other block cipher mode implementations, the CFB
implementation assumes that while walking through the scatterlist, a
partial block does not occur until the end. But the walk is incorrectly
being done with a blocksize of 1, as 'cra_blocksize' is set to 1 (since
CFB is a stream cipher) but no 'chunksize' is set. This bug causes
incorrect encryption/decryption for some scatterlist layouts.
Fix it by setting the 'chunksize'. Also extend the CFB test vectors to
cover this bug as well as cases where the message length is not a
multiple of the block size.
Fixes: a7d85e06ed80 ("crypto: cfb - add support for Cipher FeedBack mode")
Cc: <stable(a)vger.kernel.org> # v4.17+
Cc: James Bottomley <James.Bottomley(a)HansenPartnership.com>
Signed-off-by: Eric Biggers <ebiggers(a)google.com>
---
crypto/cfb.c | 6 ++++++
crypto/testmgr.h | 25 +++++++++++++++++++++++++
2 files changed, 31 insertions(+)
diff --git a/crypto/cfb.c b/crypto/cfb.c
index e81e456734985..183e8a9c33128 100644
--- a/crypto/cfb.c
+++ b/crypto/cfb.c
@@ -298,6 +298,12 @@ static int crypto_cfb_create(struct crypto_template *tmpl, struct rtattr **tb)
inst->alg.base.cra_blocksize = 1;
inst->alg.base.cra_alignmask = alg->cra_alignmask;
+ /*
+ * To simplify the implementation, configure the skcipher walk to only
+ * give a partial block at the very end, never earlier.
+ */
+ inst->alg.chunksize = alg->cra_blocksize;
+
inst->alg.ivsize = alg->cra_blocksize;
inst->alg.min_keysize = alg->cra_cipher.cia_min_keysize;
inst->alg.max_keysize = alg->cra_cipher.cia_max_keysize;
diff --git a/crypto/testmgr.h b/crypto/testmgr.h
index e8f47d7b92cdd..7f4dae7a57a1c 100644
--- a/crypto/testmgr.h
+++ b/crypto/testmgr.h
@@ -12870,6 +12870,31 @@ static const struct cipher_testvec aes_cfb_tv_template[] = {
"\x75\xa3\x85\x74\x1a\xb9\xce\xf8"
"\x20\x31\x62\x3d\x55\xb1\xe4\x71",
.len = 64,
+ .also_non_np = 1,
+ .np = 2,
+ .tap = { 31, 33 },
+ }, { /* > 16 bytes, not a multiple of 16 bytes */
+ .key = "\x2b\x7e\x15\x16\x28\xae\xd2\xa6"
+ "\xab\xf7\x15\x88\x09\xcf\x4f\x3c",
+ .klen = 16,
+ .iv = "\x00\x01\x02\x03\x04\x05\x06\x07"
+ "\x08\x09\x0a\x0b\x0c\x0d\x0e\x0f",
+ .ptext = "\x6b\xc1\xbe\xe2\x2e\x40\x9f\x96"
+ "\xe9\x3d\x7e\x11\x73\x93\x17\x2a"
+ "\xae",
+ .ctext = "\x3b\x3f\xd9\x2e\xb7\x2d\xad\x20"
+ "\x33\x34\x49\xf8\xe8\x3c\xfb\x4a"
+ "\xc8",
+ .len = 17,
+ }, { /* < 16 bytes */
+ .key = "\x2b\x7e\x15\x16\x28\xae\xd2\xa6"
+ "\xab\xf7\x15\x88\x09\xcf\x4f\x3c",
+ .klen = 16,
+ .iv = "\x00\x01\x02\x03\x04\x05\x06\x07"
+ "\x08\x09\x0a\x0b\x0c\x0d\x0e\x0f",
+ .ptext = "\x6b\xc1\xbe\xe2\x2e\x40\x9f",
+ .ctext = "\x3b\x3f\xd9\x2e\xb7\x2d\xad",
+ .len = 7,
},
};
--
2.20.1
From: Nicholas Kazlauskas <nicholas.kazlauskas(a)amd.com>
[Why]
If the "max bpc" isn't explicitly set in the atomic state then it
have a value of 0. This has the correct behavior of limiting a panel
to 8bpc in the case where the panel supports 8bpc. In the case of eDP
panels this isn't a true assumption - there are panels that can only
do 6bpc.
Banding occurs for these displays.
[How]
Initialize the max_bpc when the connector resets to 8bpc. Also carry
over the value when the state is duplicated.
Bugzilla: https://bugs.freedesktop.org/108825
Fixes: 307638884f72 ("drm/amd/display: Support amdgpu "max bpc" connector property")
Backport of commit 49f1c44b581b08e3289127ffe58bd208c3166701 upstream.
Fixes regression caused by the automatic patch select of these two patches:
https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/dri…https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/dri…
Bug: https://bugs.freedesktop.org/show_bug.cgi?id=109122
Cc: Sasha Levin <sashal(a)kernel.org>
Signed-off-by: Nicholas Kazlauskas <nicholas.kazlauskas(a)amd.com>
Reviewed-by: Alex Deucher <alexander.deucher(a)amd.com>
Signed-off-by: Alex Deucher <alexander.deucher(a)amd.com>
(cherry picked from commit 49f1c44b581b08e3289127ffe58bd208c3166701)
Cc: <stable(a)vger.kernel.org> # 4.19.x
---
drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c | 2 ++
1 file changed, 2 insertions(+)
diff --git a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
index 299def84e69c..d792735f1365 100644
--- a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
+++ b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
@@ -2894,6 +2894,7 @@ void amdgpu_dm_connector_funcs_reset(struct drm_connector *connector)
state->underscan_enable = false;
state->underscan_hborder = 0;
state->underscan_vborder = 0;
+ state->max_bpc = 8;
__drm_atomic_helper_connector_reset(connector, &state->base);
}
@@ -2911,6 +2912,7 @@ amdgpu_dm_connector_atomic_duplicate_state(struct drm_connector *connector)
if (new_state) {
__drm_atomic_helper_connector_duplicate_state(connector,
&new_state->base);
+ new_state->max_bpc = state->max_bpc;
return &new_state->base;
}
--
2.20.1
The patch titled
Subject: mm/vmalloc: fix size check for remap_vmalloc_range_partial()
has been added to the -mm tree. Its filename is
mm-vmalloc-fix-size-check-for-remap_vmalloc_range_partial.patch
This patch should soon appear at
http://ozlabs.org/~akpm/mmots/broken-out/mm-vmalloc-fix-size-check-for-rema…
and later at
http://ozlabs.org/~akpm/mmotm/broken-out/mm-vmalloc-fix-size-check-for-rema…
Before you just go and hit "reply", please:
a) Consider who else should be cc'ed
b) Prefer to cc a suitable mailing list as well
c) Ideally: find the original patch on the mailing list and do a
reply-to-all to that, adding suitable additional cc's
*** Remember to use Documentation/process/submit-checklist.rst when testing your code ***
The -mm tree is included into linux-next and is updated
there every 3-4 working days
------------------------------------------------------
From: Roman Penyaev <rpenyaev(a)suse.de>
Subject: mm/vmalloc: fix size check for remap_vmalloc_range_partial()
area->size can include adjacent guard page but get_vm_area_size() returns
actual size of the area.
This fixes possible kernel crash when userspace tries to map area on 1
page bigger: size check passes but the following vmalloc_to_page() returns
NULL on last guard (non-existing) page.
Link: http://lkml.kernel.org/r/20190103145954.16942-2-rpenyaev@suse.de
Signed-off-by: Roman Penyaev <rpenyaev(a)suse.de>
Cc: Michal Hocko <mhocko(a)suse.com>
Cc: Andrey Ryabinin <aryabinin(a)virtuozzo.com>
Cc: Joe Perches <joe(a)perches.com>
Cc: "Luis R. Rodriguez" <mcgrof(a)kernel.org>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
mm/vmalloc.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
--- a/mm/vmalloc.c~mm-vmalloc-fix-size-check-for-remap_vmalloc_range_partial
+++ a/mm/vmalloc.c
@@ -2248,7 +2248,7 @@ int remap_vmalloc_range_partial(struct v
if (!(area->flags & VM_USERMAP))
return -EINVAL;
- if (kaddr + size > area->addr + area->size)
+ if (kaddr + size > area->addr + get_vm_area_size(area))
return -EINVAL;
do {
_
Patches currently in -mm which might be from rpenyaev(a)suse.de are
epoll-make-sure-all-elements-in-ready-list-are-in-fifo-order.patch
epoll-loosen-irq-safety-in-ep_poll_callback.patch
epoll-unify-awaking-of-wakeup-source-on-ep_poll_callback-path.patch
epoll-use-rwlock-in-order-to-reduce-ep_poll_callback-contention.patch
mm-vmalloc-fix-size-check-for-remap_vmalloc_range_partial.patch
mm-vmalloc-do-not-call-kmemleak_free-on-not-yet-accounted-memory.patch
mm-vmalloc-pass-vm_usermap-flags-directly-to-__vmalloc_node_range.patch
On 1/3/19 5:52 AM, Sasha Levin wrote:
> This commit has been processed because it contains a "Fixes:" tag,
> fixing commit: e8c24d3a23a4 x86/pkeys: Allocation/free syscalls.
>
> The bot has tested the following trees: v4.20.0, v4.19.13, v4.14.91, v4.9.148,
>
> v4.20.0: Build OK!
> v4.19.13: Build OK!
> v4.14.91: Build OK!
> v4.9.148: Failed to apply! Possible dependencies:
> c10e83f598d0 ("arch, mm: Allow arch_dup_mmap() to fail")
>
>
> How should we proceed with this patch?
The 4.9 version of arch_dup_mmap() does not contain the
ldt_dup_context() line. We just need to add arch_dup_pkeys(), like so:
void arch_dup_mmap(struct mm_struct *oldmm,
struct mm_struct *mm)
{
+ arch_dup_pkeys(oldmm, mm);
paravirt_arch_dup_mmap(oldmm, mm);
}
Should be a pretty simple merge. We can basically ignore the
ldt_dup_context() changes.
The patch titled
Subject: mm/usercopy.c: no check page span for stack objects
has been added to the -mm tree. Its filename is
usercopy-no-check-page-span-for-stack-objects.patch
This patch should soon appear at
http://ozlabs.org/~akpm/mmots/broken-out/usercopy-no-check-page-span-for-st…
and later at
http://ozlabs.org/~akpm/mmotm/broken-out/usercopy-no-check-page-span-for-st…
Before you just go and hit "reply", please:
a) Consider who else should be cc'ed
b) Prefer to cc a suitable mailing list as well
c) Ideally: find the original patch on the mailing list and do a
reply-to-all to that, adding suitable additional cc's
*** Remember to use Documentation/process/submit-checklist.rst when testing your code ***
The -mm tree is included into linux-next and is updated
there every 3-4 working days
------------------------------------------------------
From: Qian Cai <cai(a)lca.pw>
Subject: mm/usercopy.c: no check page span for stack objects
It is easy to trigger this with CONFIG_HARDENED_USERCOPY_PAGESPAN=y,
usercopy: Kernel memory overwrite attempt detected to spans multiple
pages (offset 0, size 23)!
kernel BUG at mm/usercopy.c:102!
For example,
print_worker_info
char name[WQ_NAME_LEN] = { };
char desc[WORKER_DESC_LEN] = { };
probe_kernel_read(name, wq->name, sizeof(name) - 1);
probe_kernel_read(desc, worker->desc, sizeof(desc) - 1);
__copy_from_user_inatomic
check_object_size
check_heap_object
check_page_span
This is because on-stack variables could cross PAGE_SIZE boundary, and
failed this check,
if (likely(((unsigned long)ptr & (unsigned long)PAGE_MASK) ==
((unsigned long)end & (unsigned long)PAGE_MASK)))
ptr = FFFF889007D7EFF8
end = FFFF889007D7F00E
Hence, fix it by checking if it is a stack object first.
Link: http://lkml.kernel.org/r/20181231030254.99441-1-cai@lca.pw
Signed-off-by: Qian Cai <cai(a)lca.pw>
Cc: Kees Cook <keescook(a)chromium.org>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
mm/usercopy.c | 6 +++---
1 file changed, 3 insertions(+), 3 deletions(-)
--- a/mm/usercopy.c~usercopy-no-check-page-span-for-stack-objects
+++ a/mm/usercopy.c
@@ -262,9 +262,6 @@ void __check_object_size(const void *ptr
/* Check for invalid addresses. */
check_bogus_address((const unsigned long)ptr, n, to_user);
- /* Check for bad heap object. */
- check_heap_object(ptr, n, to_user);
-
/* Check for bad stack object. */
switch (check_stack_object(ptr, n)) {
case NOT_STACK:
@@ -282,6 +279,9 @@ void __check_object_size(const void *ptr
usercopy_abort("process stack", NULL, to_user, 0, n);
}
+ /* Check for bad heap object. */
+ check_heap_object(ptr, n, to_user);
+
/* Check for object in kernel to avoid text exposure. */
check_kernel_text_object((const unsigned long)ptr, n, to_user);
}
_
Patches currently in -mm which might be from cai(a)lca.pw are
mm-page_owner-fix-for-deferred-struct-page-init.patch
usercopy-no-check-page-span-for-stack-objects.patch
Hi folks,
I noticed a regression introduced sometime after 4.19.4 in USB power
management. I have a 2015 MacBook Pro. When I try to do a suspend or a
suspend+hibernate, I get the following error messages trying to
suspend usb2 and the suspend fails. This works fine in 4.19.4:
Dec 22 13:50:36 eric-macbookpro kernel: Freezing remaining freezable
tasks ... (elapsed 0.001 seconds) done.
Dec 22 13:50:36 eric-macbookpro kernel: Suspending console(s) (use
no_console_suspend to debug)
Dec 22 13:50:36 eric-macbookpro kernel: dpm_run_callback():
usb_dev_freeze+0x0/0x10 returns -16
Dec 22 13:50:36 eric-macbookpro kernel: PM: Device usb2 failed to
freeze async: error -16
Dec 22 13:50:38 eric-macbookpro systemd[1]:
systemd-hybrid-sleep.service: Main process exited, code=exited,
status=1/FAILURE
Dec 22 13:50:38 eric-macbookpro systemd[1]:
systemd-hybrid-sleep.service: Failed with result 'exit-code'.
Dec 22 13:50:38 eric-macbookpro systemd[1]: Failed to start Hybrid
Suspend+Hibernate.
Dec 22 13:50:38 eric-macbookpro systemd[1]: Dependency failed for
Hybrid Suspend+Hibernate.
Dec 22 13:50:38 eric-macbookpro systemd[1]: hybrid-sleep.target: Job
hybrid-sleep.target/start failed with result 'dependency'.
Dec 22 13:50:38 eric-macbookpro systemd-logind[1573]: Operation
'sleep' finished.
Dec 22 13:50:38 eric-macbookpro systemd[1]: Stopped target Sleep.
The behavior exists in 4.19.8 and 4.19.11, the kernel versions I have
upgraded to with Arch Linux, so the regression was introduced sometime
between 4.19.4 and 4.19.8. Hibernate still works but when I resume
from hibernate, there is a ksoftirqd and kworker thread/process
together taking up 100% of one core. If I turn off auto power control
for usb1 and usb2, the threads stop spinning. i.e.,
echo 'on' > '/sys/bus/usb/devices/usb1/power/control
Any suggestions as to where this regression was introduced and what
can be done to fix it?
Thanks,
Eric
From: Stanley Chu <stanley.chu(a)mediatek.com>
The commit 356fd2663cff ("scsi: Set request queue runtime PM status
back to active on resume") fixed up the inconsistent RPM status between
request queue and device. However changing request queue RPM status
shall be done only on successful resume, otherwise status may be still
inconsistent as below,
Request queue: RPM_ACTIVE
Device: RPM_SUSPENDED
This ends up soft lockup because requests can be submitted to
underlying devices but those devices and their required resource
are not resumed.
Fixes: 356fd2663cff ("scsi: Set request queue runtime PM status
back to active on resume")
Cc: stable(a)vger.kernel.org
Signed-off-by: Stanley Chu <stanley.chu(a)mediatek.com>
---
drivers/scsi/scsi_pm.c | 26 +++++++++++++++-----------
1 file changed, 15 insertions(+), 11 deletions(-)
diff --git a/drivers/scsi/scsi_pm.c b/drivers/scsi/scsi_pm.c
index a2b4179bfdf7..7639df91b110 100644
--- a/drivers/scsi/scsi_pm.c
+++ b/drivers/scsi/scsi_pm.c
@@ -80,8 +80,22 @@ static int scsi_dev_type_resume(struct device *dev,
if (err == 0) {
pm_runtime_disable(dev);
- pm_runtime_set_active(dev);
+ err = pm_runtime_set_active(dev);
pm_runtime_enable(dev);
+
+ /*
+ * Forcibly set runtime PM status of request queue to "active"
+ * to make sure we can again get requests from the queue
+ * (see also blk_pm_peek_request()).
+ *
+ * The resume hook will correct runtime PM status of the disk.
+ */
+ if (!err && scsi_is_sdev_device(dev)) {
+ struct scsi_device *sdev = to_scsi_device(dev);
+
+ if (sdev->request_queue->dev)
+ blk_set_runtime_active(sdev->request_queue);
+ }
}
return err;
@@ -140,16 +154,6 @@ static int scsi_bus_resume_common(struct device *dev,
else
fn = NULL;
- /*
- * Forcibly set runtime PM status of request queue to "active" to
- * make sure we can again get requests from the queue (see also
- * blk_pm_peek_request()).
- *
- * The resume hook will correct runtime PM status of the disk.
- */
- if (scsi_is_sdev_device(dev) && pm_runtime_suspended(dev))
- blk_set_runtime_active(to_scsi_device(dev)->request_queue);
-
if (fn) {
async_schedule_domain(fn, dev, &scsi_sd_pm_domain);
--
2.18.0
Commit 2b2ea09e74a5 ("staging:r8188eu: Use lib80211 to decrypt WEP-frames")
causes scheduling while atomic bugs followed by a hard freeze whenever
the driver tries to connect to a WEP-encrypted network. Experimentation
showed that the freezes were eliminated when module lib80211 was
preloaded, which can be forced by calling lib80211_get_crypto_ops()
directly rather than indirectly through try_then_request_module().
With this change, no BUG messages are logged.
Fixes: 2b2ea09e74a5 ("staging:r8188eu: Use lib80211 to decrypt WEP-frames")
Cc: Stable <stable(a)vger.kernel.org> # v4.17+
Cc: Michael Straube <straube.linux(a)gmail.com>
Cc: Ivan Safonov <insafonov(a)gmail.com>
Signed-off-by: Larry Finger <Larry.Finger(a)lwfinger.net>
---
drivers/staging/rtl8188eu/core/rtw_security.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/drivers/staging/rtl8188eu/core/rtw_security.c b/drivers/staging/rtl8188eu/core/rtw_security.c
index 052656a22821..bab96c870042 100644
--- a/drivers/staging/rtl8188eu/core/rtw_security.c
+++ b/drivers/staging/rtl8188eu/core/rtw_security.c
@@ -154,7 +154,7 @@ void rtw_wep_encrypt(struct adapter *padapter, u8 *pxmitframe)
pframe = ((struct xmit_frame *)pxmitframe)->buf_addr + hw_hdr_offset;
- crypto_ops = try_then_request_module(lib80211_get_crypto_ops("WEP"), "lib80211_crypt_wep");
+ crypto_ops = lib80211_get_crypto_ops("WEP");
if (!crypto_ops)
return;
@@ -210,7 +210,7 @@ int rtw_wep_decrypt(struct adapter *padapter, u8 *precvframe)
void *crypto_private = NULL;
int status = _SUCCESS;
const int keyindex = prxattrib->key_index;
- struct lib80211_crypto_ops *crypto_ops = try_then_request_module(lib80211_get_crypto_ops("WEP"), "lib80211_crypt_wep");
+ struct lib80211_crypto_ops *crypto_ops = lib80211_get_crypto_ops("WEP");
char iv[4], icv[4];
if (!crypto_ops) {
--
2.16.4
An iptable rule like the following on a multicore systems will result in
accepting more connections than set in the rule.
iptables -A INPUT -p tcp -m tcp --syn --dport 7777 -m connlimit \
--connlimit-above 2000 --connlimit-mask 0 -j DROP
In check_hlist function, connections that are found in saved connections
but not in netfilter conntrack are deleted, assuming that those
connections do not exist anymore. But for multi core systems, there exists
a small time window, when a connection has been added to the xt_connlimit
maintained rb-tree but has not yet made to netfilter conntrack table. This
causes concurrent connections to return incorrect counts and go over limit
set in iptable rule.
The fix has been partially backported from the above mentioned upstream
commit. Introduce timestamp and the owning cpu.
Signed-off-by: Alakesh Haloi <alakeshh(a)amazon.com>
Cc: Pablo Neira Ayuso <pablo(a)netfilter.org>
Cc: Jozsef Kadlecsik <kadlec(a)blackhole.kfki.hu>
Cc: Florian Westphal <fw(a)strlen.de>
Cc: "David S. Miller" <davem(a)davemloft.net>
Cc: stable(a)vger.kernel.org # v4.15 and before
Cc: netdev(a)vger.kernel.org
Cc: Dmitry Andrianov <dmitry.andrianov(a)alertme.com>
Cc: Justin Pettit <jpettit(a)vmware.com>
Cc: Yi-Hung Wei <yihung.wei(a)gmail.com>
---
net/netfilter/xt_connlimit.c | 28 ++++++++++++++++++++++++++--
1 file changed, 26 insertions(+), 2 deletions(-)
diff --git a/net/netfilter/xt_connlimit.c b/net/netfilter/xt_connlimit.c
index ffa8eec..e7b092b 100644
--- a/net/netfilter/xt_connlimit.c
+++ b/net/netfilter/xt_connlimit.c
@@ -47,6 +47,8 @@ struct xt_connlimit_conn {
struct hlist_node node;
struct nf_conntrack_tuple tuple;
union nf_inet_addr addr;
+ int cpu;
+ u32 jiffies32;
};
struct xt_connlimit_rb {
@@ -126,6 +128,8 @@ static bool add_hlist(struct hlist_head *head,
return false;
conn->tuple = *tuple;
conn->addr = *addr;
+ conn->cpu = raw_smp_processor_id();
+ conn->jiffies32 = (u32)jiffies;
hlist_add_head(&conn->node, head);
return true;
}
@@ -148,8 +152,26 @@ static unsigned int check_hlist(struct net *net,
hlist_for_each_entry_safe(conn, n, head, node) {
found = nf_conntrack_find_get(net, zone, &conn->tuple);
if (found == NULL) {
- hlist_del(&conn->node);
- kmem_cache_free(connlimit_conn_cachep, conn);
+ /* If connection is not found, it may be because
+ * it has not made into conntrack table yet. We
+ * check if it is a recently created connection
+ * on a different core and do not delete it in that
+ * case.
+ */
+
+ unsigned long a, b;
+ int cpu = raw_smp_processor_id();
+ __u32 age;
+
+ b = conn->jiffies;
+ a = (u32)jiffies;
+ age = a - b;
+ if (conn->cpu != cpu && age <= 2) {
+ length++;
+ } else {
+ hlist_del(&conn->node);
+ kmem_cache_free(connlimit_conn_cachep, conn);
+ }
continue;
}
@@ -271,6 +293,8 @@ static void tree_nodes_free(struct rb_root *root,
conn->tuple = *tuple;
conn->addr = *addr;
+ conn->cpu = raw_smp_processor_id();
+ conn->jiffies32 = (u32)jiffies;
rbconn->addr = *addr;
INIT_HLIST_HEAD(&rbconn->hhead);
--
1.8.3.1
An iptable rule like the following on a multicore systems will result in
accepting more connections than set in the rule.
iptables -A INPUT -p tcp -m tcp --syn --dport 7777 -m connlimit \
--connlimit-above 2000 --connlimit-mask 0 -j DROP
In check_hlist function, connections that are found in saved connections
but not in netfilter conntrack are deleted, assuming that those
connections do not exist anymore. But for multi core systems, there exists
a small time window, when a connection has been added to the xt_connlimit
maintained rb-tree but has not yet made to netfilter conntrack table. This
causes concurrent connections to return incorrect counts and go over limit
set in iptable rule.
Connection 1 on Core 1 Connection 2 on Core 2
list_length = N
conntrack_table_len = N
spin_lock_bh()
In check_hlist() function
a. loop over saved connections
1. call nf_conntrack_find_get()
2. If not found in 1,
i. call hlist_del()
b. return total count to caller
c. connection gets added to list
of saved connections.
spin_unlock_bh()
list_length = N + 1
spin_lock_bh() on core 2
In check_hlist() function
a. loop over saved connection.
1. call nf_conntrack_find_get()
2. If not found in 1.
i. call hlist_del()
[Connection 1 was in the
but not in nf_conntrack yet]
ii. connection 1 gets deleted
list_length = N
conntrack_table_len = N
b. return total count to caller
c. connection 2 gets added to list
of saved connections
spin_unlock_bh()
d. connection 1 gets added to
nf_conntrack
list_length = N + 1
conntrack_table_len = N + 1
e. connection 2 gets added to
nf_conntrack
list_length = N + 1
conntrack_table_len = N + 2
So we end up with N + 1 connections in the list but N + 2 in nf_conntrack,
allowing more number of connections eventually than set in the rule.
This fix adds an additional field to track such pending connections
and prevent them from being deleted by another execution thread on
a different core and returns correct count.
Signed-off-by: Alakesh Haloi <alakeshh(a)amazon.com>
Cc: Pablo Neira Ayuso <pablo(a)netfilter.org>
Cc: Jozsef Kadlecsik <kadlec(a)blackhole.kfki.hu>
Cc: Florian Westphal <fw(a)strlen.de>
Cc: "David S. Miller" <davem(a)davemloft.net>
Cc: stable(a)vger.kernel.org # v4.15 and before
---
net/netfilter/xt_connlimit.c | 24 +++++++++++++++++++++---
1 file changed, 21 insertions(+), 3 deletions(-)
diff --git a/net/netfilter/xt_connlimit.c b/net/netfilter/xt_connlimit.c
index ffa8eec980e9..bd7563c209a4 100644
--- a/net/netfilter/xt_connlimit.c
+++ b/net/netfilter/xt_connlimit.c
@@ -47,6 +47,7 @@ struct xt_connlimit_conn {
struct hlist_node node;
struct nf_conntrack_tuple tuple;
union nf_inet_addr addr;
+ bool pending_add;
};
struct xt_connlimit_rb {
@@ -126,6 +127,7 @@ static bool add_hlist(struct hlist_head *head,
return false;
conn->tuple = *tuple;
conn->addr = *addr;
+ conn->pending_add = true;
hlist_add_head(&conn->node, head);
return true;
}
@@ -144,15 +146,31 @@ static unsigned int check_hlist(struct net *net,
*addit = true;
- /* check the saved connections */
+ /* check the saved connections
+ */
hlist_for_each_entry_safe(conn, n, head, node) {
found = nf_conntrack_find_get(net, zone, &conn->tuple);
if (found == NULL) {
- hlist_del(&conn->node);
- kmem_cache_free(connlimit_conn_cachep, conn);
+ /* It could be an already deleted connection or
+ * a new connection that is not there in conntrack
+ * yet. If former delete it from the list, else
+ * increase count and move on.
+ */
+ if (conn->pending_add) {
+ length++;
+ } else {
+ hlist_del(&conn->node);
+ kmem_cache_free(connlimit_conn_cachep, conn);
+ }
continue;
}
+ /* If it is a connection that was pending insertion to
+ * connection tracking table before, then it's time to clear
+ * the flag.
+ */
+ conn->pending_add = false;
+
found_ct = nf_ct_tuplehash_to_ctrack(found);
if (nf_ct_tuple_equal(&conn->tuple, tuple)) {
--
2.14.4
Hi x86 maintainers,
This is an important fix that I believe needs to be merged for 4.21.
Without it, applications calling fork() can potentially double-allocate
a protection key, causing lots of strange problems.
Thomas's Reviewed-by is on the the actual fix, but not the selftest.
I would also be happy to send this as a pull request if you would
prefer.
Cc: x86(a)kernel.org
Cc: Thomas Gleixner <tglx(a)linutronix.de>
Cc: Ingo Molnar <mingo(a)redhat.com>
Cc: Borislav Petkov <bp(a)alien8.de>
Cc: "H. Peter Anvin" <hpa(a)zytor.com>
Cc: Peter Zijlstra <peterz(a)infradead.org>
Cc: Michael Ellerman <mpe(a)ellerman.id.au>
Cc: Will Deacon <will.deacon(a)arm.com>
Cc: Andy Lutomirski <luto(a)kernel.org>
Cc: Joerg Roedel <jroedel(a)suse.de>
Cc: stable(a)vger.kernel.org
The patch titled
Subject: slab: alien caches must not be initialized if the allocation of the alien cache failed
has been added to the -mm tree. Its filename is
slab-alien-caches-must-not-be-initialized-if-the-allocation-of-the-alien-cache-failed.patch
This patch should soon appear at
http://ozlabs.org/~akpm/mmots/broken-out/slab-alien-caches-must-not-be-init…
and later at
http://ozlabs.org/~akpm/mmotm/broken-out/slab-alien-caches-must-not-be-init…
Before you just go and hit "reply", please:
a) Consider who else should be cc'ed
b) Prefer to cc a suitable mailing list as well
c) Ideally: find the original patch on the mailing list and do a
reply-to-all to that, adding suitable additional cc's
*** Remember to use Documentation/process/submit-checklist.rst when testing your code ***
The -mm tree is included into linux-next and is updated
there every 3-4 working days
------------------------------------------------------
From: Christoph Lameter <cl(a)linux.com>
Subject: slab: alien caches must not be initialized if the allocation of the alien cache failed
Callers of __alloc_alien() check for NULL. We must do the same check in
__alloc_alien_cache to avoid NULL pointer dereferences on allocation
failures.
Link: http://lkml.kernel.org/r/010001680f42f192-82b4e12e-1565-4ee0-ae1f-1e9897490…
Signed-off-by: Christoph Lameter <cl(a)linux.com>
Reported-by: syzbot+d6ed4ec679652b4fd4e4(a)syzkaller.appspotmail.com
Reviewed-by: Andrew Morton <akpm(a)linux-foundation.org>
Cc: Pekka Enberg <penberg(a)kernel.org>
Cc: David Rientjes <rientjes(a)google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim(a)lge.com>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
mm/slab.c | 6 ++++--
1 file changed, 4 insertions(+), 2 deletions(-)
--- a/mm/slab.c~slab-alien-caches-must-not-be-initialized-if-the-allocation-of-the-alien-cache-failed
+++ a/mm/slab.c
@@ -666,8 +666,10 @@ static struct alien_cache *__alloc_alien
struct alien_cache *alc = NULL;
alc = kmalloc_node(memsize, gfp, node);
- init_arraycache(&alc->ac, entries, batch);
- spin_lock_init(&alc->lock);
+ if (alc) {
+ init_arraycache(&alc->ac, entries, batch);
+ spin_lock_init(&alc->lock);
+ }
return alc;
}
_
Patches currently in -mm which might be from cl(a)linux.com are
slab-alien-caches-must-not-be-initialized-if-the-allocation-of-the-alien-cache-failed.patch