From: "Joel Fernandes (Google)" <joel(a)joelfernandes.org>
[ Upstream commit b53b0b9d9a613c418057f6cb921c2f40a6f78c24 ]
This patch adds polling support to pidfd.
Android low memory killer (LMK) needs to know when a process dies once
it is sent the kill signal. It does so by checking for the existence of
/proc/pid which is both racy and slow. For example, if a PID is reused
between when LMK sends a kill signal and checks for existence of the
PID, since the wrong PID is now possibly checked for existence.
Using the polling support, LMK will be able to get notified when a process
exists in race-free and fast way, and allows the LMK to do other things
(such as by polling on other fds) while awaiting the process being killed
to die.
For notification to polling processes, we follow the same existing
mechanism in the kernel used when the parent of the task group is to be
notified of a child's death (do_notify_parent). This is precisely when the
tasks waiting on a poll of pidfd are also awakened in this patch.
We have decided to include the waitqueue in struct pid for the following
reasons:
1. The wait queue has to survive for the lifetime of the poll. Including
it in task_struct would not be option in this case because the task can
be reaped and destroyed before the poll returns.
2. By including the struct pid for the waitqueue means that during
de_thread(), the new thread group leader automatically gets the new
waitqueue/pid even though its task_struct is different.
Appropriate test cases are added in the second patch to provide coverage of
all the cases the patch is handling.
Cc: Andy Lutomirski <luto(a)amacapital.net>
Cc: Steven Rostedt <rostedt(a)goodmis.org>
Cc: Daniel Colascione <dancol(a)google.com>
Cc: Jann Horn <jannh(a)google.com>
Cc: Tim Murray <timmurray(a)google.com>
Cc: Jonathan Kowalski <bl0pbl33p(a)gmail.com>
Cc: Linus Torvalds <torvalds(a)linux-foundation.org>
Cc: Al Viro <viro(a)zeniv.linux.org.uk>
Cc: Kees Cook <keescook(a)chromium.org>
Cc: David Howells <dhowells(a)redhat.com>
Cc: Oleg Nesterov <oleg(a)redhat.com>
Cc: kernel-team(a)android.com
Reviewed-by: Oleg Nesterov <oleg(a)redhat.com>
Co-developed-by: Daniel Colascione <dancol(a)google.com>
Signed-off-by: Daniel Colascione <dancol(a)google.com>
Signed-off-by: Joel Fernandes (Google) <joel(a)joelfernandes.org>
Signed-off-by: Christian Brauner <christian(a)brauner.io>
Cc: <stable(a)vger.kernel.org> # 4.19.x
Signed-off-by: Wen Yang <wenyang(a)linux.alibaba.com>
---
include/linux/pid.h | 3 +++
kernel/fork.c | 26 ++++++++++++++++++++++++++
kernel/pid.c | 2 ++
kernel/signal.c | 11 +++++++++++
4 files changed, 42 insertions(+)
diff --git a/include/linux/pid.h b/include/linux/pid.h
index 29c0a99..a82d2f7 100644
--- a/include/linux/pid.h
+++ b/include/linux/pid.h
@@ -3,6 +3,7 @@
#define _LINUX_PID_H
#include <linux/rculist.h>
+#include <linux/wait.h>
enum pid_type
{
@@ -60,6 +61,8 @@ struct pid
unsigned int level;
/* lists of tasks that use this pid */
struct hlist_head tasks[PIDTYPE_MAX];
+ /* wait queue for pidfd notifications */
+ wait_queue_head_t wait_pidfd;
struct rcu_head rcu;
struct upid numbers[1];
};
diff --git a/kernel/fork.c b/kernel/fork.c
index e419891..33dc746 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1688,8 +1688,34 @@ static void pidfd_show_fdinfo(struct seq_file *m, struct file *f)
}
#endif
+/*
+ * Poll support for process exit notification.
+ */
+static unsigned int pidfd_poll(struct file *file, struct poll_table_struct *pts)
+{
+ struct task_struct *task;
+ struct pid *pid = file->private_data;
+ int poll_flags = 0;
+
+ poll_wait(file, &pid->wait_pidfd, pts);
+
+ rcu_read_lock();
+ task = pid_task(pid, PIDTYPE_PID);
+ /*
+ * Inform pollers only when the whole thread group exits.
+ * If the thread group leader exits before all other threads in the
+ * group, then poll(2) should block, similar to the wait(2) family.
+ */
+ if (!task || (task->exit_state && thread_group_empty(task)))
+ poll_flags = POLLIN | POLLRDNORM;
+ rcu_read_unlock();
+
+ return poll_flags;
+}
+
const struct file_operations pidfd_fops = {
.release = pidfd_release,
+ .poll = pidfd_poll,
#ifdef CONFIG_PROC_FS
.show_fdinfo = pidfd_show_fdinfo,
#endif
diff --git a/kernel/pid.c b/kernel/pid.c
index b88fe5e..3ba6fcb 100644
--- a/kernel/pid.c
+++ b/kernel/pid.c
@@ -214,6 +214,8 @@ struct pid *alloc_pid(struct pid_namespace *ns)
for (type = 0; type < PIDTYPE_MAX; ++type)
INIT_HLIST_HEAD(&pid->tasks[type]);
+ init_waitqueue_head(&pid->wait_pidfd);
+
upid = pid->numbers + ns->level;
spin_lock_irq(&pidmap_lock);
if (!(ns->pid_allocated & PIDNS_ADDING))
diff --git a/kernel/signal.c b/kernel/signal.c
index a02a25a..22a04795 100644
--- a/kernel/signal.c
+++ b/kernel/signal.c
@@ -1810,6 +1810,14 @@ int send_sigqueue(struct sigqueue *q, struct pid *pid, enum pid_type type)
return ret;
}
+static void do_notify_pidfd(struct task_struct *task)
+{
+ struct pid *pid;
+
+ pid = task_pid(task);
+ wake_up_all(&pid->wait_pidfd);
+}
+
/*
* Let a parent know about the death of a child.
* For a stopped/continued status change, use do_notify_parent_cldstop instead.
@@ -1833,6 +1841,9 @@ bool do_notify_parent(struct task_struct *tsk, int sig)
BUG_ON(!tsk->ptrace &&
(tsk->group_leader != tsk || !thread_group_empty(tsk)));
+ /* Wake up all pidfd waiters */
+ do_notify_pidfd(tsk);
+
if (sig != SIGCHLD) {
/*
* This is only possible if parent == real_parent.
--
1.8.3.1
From: Christian Brauner <christian(a)brauner.io>
[ Upstream commit b3e5838252665ee4cfa76b82bdf1198dca81e5be ]
This patchset makes it possible to retrieve pid file descriptors at
process creation time by introducing the new flag CLONE_PIDFD to the
clone() system call. Linus originally suggested to implement this as a
new flag to clone() instead of making it a separate system call. As
spotted by Linus, there is exactly one bit for clone() left.
CLONE_PIDFD creates file descriptors based on the anonymous inode
implementation in the kernel that will also be used to implement the new
mount api. They serve as a simple opaque handle on pids. Logically,
this makes it possible to interpret a pidfd differently, narrowing or
widening the scope of various operations (e.g. signal sending). Thus, a
pidfd cannot just refer to a tgid, but also a tid, or in theory - given
appropriate flag arguments in relevant syscalls - a process group or
session. A pidfd does not represent a privilege. This does not imply it
cannot ever be that way but for now this is not the case.
A pidfd comes with additional information in fdinfo if the kernel supports
procfs. The fdinfo file contains the pid of the process in the callers
pid namespace in the same format as the procfs status file, i.e. "Pid:\t%d".
As suggested by Oleg, with CLONE_PIDFD the pidfd is returned in the
parent_tidptr argument of clone. This has the advantage that we can
give back the associated pid and the pidfd at the same time.
To remove worries about missing metadata access this patchset comes with
a sample program that illustrates how a combination of CLONE_PIDFD, and
pidfd_send_signal() can be used to gain race-free access to process
metadata through /proc/<pid>. The sample program can easily be
translated into a helper that would be suitable for inclusion in libc so
that users don't have to worry about writing it themselves.
Suggested-by: Linus Torvalds <torvalds(a)linux-foundation.org>
Signed-off-by: Christian Brauner <christian(a)brauner.io>
Co-developed-by: Jann Horn <jannh(a)google.com>
Signed-off-by: Jann Horn <jannh(a)google.com>
Reviewed-by: Oleg Nesterov <oleg(a)redhat.com>
Cc: Arnd Bergmann <arnd(a)arndb.de>
Cc: "Eric W. Biederman" <ebiederm(a)xmission.com>
Cc: Kees Cook <keescook(a)chromium.org>
Cc: Thomas Gleixner <tglx(a)linutronix.de>
Cc: David Howells <dhowells(a)redhat.com>
Cc: "Michael Kerrisk (man-pages)" <mtk.manpages(a)gmail.com>
Cc: Andy Lutomirsky <luto(a)kernel.org>
Cc: Andrew Morton <akpm(a)linux-foundation.org>
Cc: Aleksa Sarai <cyphar(a)cyphar.com>
Cc: Linus Torvalds <torvalds(a)linux-foundation.org>
Cc: Al Viro <viro(a)zeniv.linux.org.uk>
Cc: <stable(a)vger.kernel.org> # 4.19.x
(clone: fix up cherry-pick conflicts for b3e583825266)
Signed-off-by: Wen Yang <wenyang(a)linux.alibaba.com>
---
include/linux/pid.h | 2 +
include/uapi/linux/sched.h | 1 +
kernel/fork.c | 107 +++++++++++++++++++++++++++++++++++++++++++--
3 files changed, 106 insertions(+), 4 deletions(-)
diff --git a/include/linux/pid.h b/include/linux/pid.h
index 14a9a39..29c0a99 100644
--- a/include/linux/pid.h
+++ b/include/linux/pid.h
@@ -66,6 +66,8 @@ struct pid
extern struct pid init_struct_pid;
+extern const struct file_operations pidfd_fops;
+
static inline struct pid *get_pid(struct pid *pid)
{
if (pid)
diff --git a/include/uapi/linux/sched.h b/include/uapi/linux/sched.h
index 22627f8..ed4ee17 100644
--- a/include/uapi/linux/sched.h
+++ b/include/uapi/linux/sched.h
@@ -10,6 +10,7 @@
#define CLONE_FS 0x00000200 /* set if fs info shared between processes */
#define CLONE_FILES 0x00000400 /* set if open files shared between processes */
#define CLONE_SIGHAND 0x00000800 /* set if signal handlers and blocked signals shared */
+#define CLONE_PIDFD 0x00001000 /* set if a pidfd should be placed in parent */
#define CLONE_PTRACE 0x00002000 /* set if we want to let tracing continue on the child too */
#define CLONE_VFORK 0x00004000 /* set if the parent wants the child to wake it up on mm_release */
#define CLONE_PARENT 0x00008000 /* set if we want to have the same parent as the cloner */
diff --git a/kernel/fork.c b/kernel/fork.c
index f2c92c1..e419891 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -11,6 +11,7 @@
* management can be a bitch. See 'mm/memory.c': 'copy_page_range()'
*/
+#include <linux/anon_inodes.h>
#include <linux/slab.h>
#include <linux/sched/autogroup.h>
#include <linux/sched/mm.h>
@@ -21,6 +22,7 @@
#include <linux/sched/task.h>
#include <linux/sched/task_stack.h>
#include <linux/sched/cputime.h>
+#include <linux/seq_file.h>
#include <linux/rtmutex.h>
#include <linux/init.h>
#include <linux/unistd.h>
@@ -1666,6 +1668,58 @@ static void copy_oom_score_adj(u64 clone_flags, struct task_struct *tsk)
mutex_unlock(&oom_adj_mutex);
}
+static int pidfd_release(struct inode *inode, struct file *file)
+{
+ struct pid *pid = file->private_data;
+
+ file->private_data = NULL;
+ put_pid(pid);
+ return 0;
+}
+
+#ifdef CONFIG_PROC_FS
+static void pidfd_show_fdinfo(struct seq_file *m, struct file *f)
+{
+ struct pid_namespace *ns = proc_pid_ns(file_inode(m->file));
+ struct pid *pid = f->private_data;
+
+ seq_put_decimal_ull(m, "Pid:\t", pid_nr_ns(pid, ns));
+ seq_putc(m, '\n');
+}
+#endif
+
+const struct file_operations pidfd_fops = {
+ .release = pidfd_release,
+#ifdef CONFIG_PROC_FS
+ .show_fdinfo = pidfd_show_fdinfo,
+#endif
+};
+
+/**
+ * pidfd_create() - Create a new pid file descriptor.
+ *
+ * @pid: struct pid that the pidfd will reference
+ *
+ * This creates a new pid file descriptor with the O_CLOEXEC flag set.
+ *
+ * Note, that this function can only be called after the fd table has
+ * been unshared to avoid leaking the pidfd to the new process.
+ *
+ * Return: On success, a cloexec pidfd is returned.
+ * On error, a negative errno number will be returned.
+ */
+static int pidfd_create(struct pid *pid)
+{
+ int fd;
+
+ fd = anon_inode_getfd("[pidfd]", &pidfd_fops, get_pid(pid),
+ O_RDWR | O_CLOEXEC);
+ if (fd < 0)
+ put_pid(pid);
+
+ return fd;
+}
+
/*
* This creates a new process as a copy of the old one,
* but does not actually start it yet.
@@ -1678,13 +1732,14 @@ static __latent_entropy struct task_struct *copy_process(
unsigned long clone_flags,
unsigned long stack_start,
unsigned long stack_size,
+ int __user *parent_tidptr,
int __user *child_tidptr,
struct pid *pid,
int trace,
unsigned long tls,
int node)
{
- int retval;
+ int pidfd = -1, retval;
struct task_struct *p;
struct multiprocess_signals delayed;
@@ -1734,6 +1789,31 @@ static __latent_entropy struct task_struct *copy_process(
return ERR_PTR(-EINVAL);
}
+ if (clone_flags & CLONE_PIDFD) {
+ int reserved;
+
+ /*
+ * - CLONE_PARENT_SETTID is useless for pidfds and also
+ * parent_tidptr is used to return pidfds.
+ * - CLONE_DETACHED is blocked so that we can potentially
+ * reuse it later for CLONE_PIDFD.
+ * - CLONE_THREAD is blocked until someone really needs it.
+ */
+ if (clone_flags &
+ (CLONE_DETACHED | CLONE_PARENT_SETTID | CLONE_THREAD))
+ return ERR_PTR(-EINVAL);
+
+ /*
+ * Verify that parent_tidptr is sane so we can potentially
+ * reuse it later.
+ */
+ if (get_user(reserved, parent_tidptr))
+ return ERR_PTR(-EFAULT);
+
+ if (reserved != 0)
+ return ERR_PTR(-EINVAL);
+ }
+
/*
* Force any signals received before this point to be delivered
* before the fork happens. Collect up signals sent to multiple
@@ -1934,6 +2014,22 @@ static __latent_entropy struct task_struct *copy_process(
}
}
+ /*
+ * This has to happen after we've potentially unshared the file
+ * descriptor table (so that the pidfd doesn't leak into the child
+ * if the fd table isn't shared).
+ */
+ if (clone_flags & CLONE_PIDFD) {
+ retval = pidfd_create(pid);
+ if (retval < 0)
+ goto bad_fork_free_pid;
+
+ pidfd = retval;
+ retval = put_user(pidfd, parent_tidptr);
+ if (retval)
+ goto bad_fork_put_pidfd;
+ }
+
#ifdef CONFIG_BLOCK
p->plug = NULL;
#endif
@@ -1989,7 +2085,7 @@ static __latent_entropy struct task_struct *copy_process(
*/
retval = cgroup_can_fork(p);
if (retval)
- goto bad_fork_free_pid;
+ goto bad_fork_put_pidfd;
/*
* From this point on we must avoid any synchronous user-space
@@ -2111,6 +2207,9 @@ static __latent_entropy struct task_struct *copy_process(
spin_unlock(¤t->sighand->siglock);
write_unlock_irq(&tasklist_lock);
cgroup_cancel_fork(p);
+bad_fork_put_pidfd:
+ if (clone_flags & CLONE_PIDFD)
+ ksys_close(pidfd);
bad_fork_free_pid:
cgroup_threadgroup_change_end(current);
if (pid != &init_struct_pid)
@@ -2178,7 +2277,7 @@ static inline void init_idle_pids(struct task_struct *idle)
struct task_struct *fork_idle(int cpu)
{
struct task_struct *task;
- task = copy_process(CLONE_VM, 0, 0, NULL, &init_struct_pid, 0, 0,
+ task = copy_process(CLONE_VM, 0, 0, NULL, NULL, &init_struct_pid, 0, 0,
cpu_to_node(cpu));
if (!IS_ERR(task)) {
init_idle_pids(task);
@@ -2225,7 +2324,7 @@ long _do_fork(unsigned long clone_flags,
trace = 0;
}
- p = copy_process(clone_flags, stack_start, stack_size,
+ p = copy_process(clone_flags, stack_start, stack_size, parent_tidptr,
child_tidptr, NULL, trace, tls, NUMA_NO_NODE);
add_latent_entropy();
--
1.8.3.1
From: "Eric W. Biederman" <ebiederm(a)xmission.com>
[ Upstream commit 7bc3e6e55acf065500a24621f3b313e7e5998acf ]
Rework the flushing of proc to use a list of directory inodes that
need to be flushed.
The list is kept on struct pid not on struct task_struct, as there is
a fixed connection between proc inodes and pids but at least for the
case of de_thread the pid of a task_struct changes.
This removes the dependency on proc_mnt which allows for different
mounts of proc having different mount options even in the same pid
namespace and this allows for the removal of proc_mnt which will
trivially the first mount of proc to honor it's mount options.
This flushing remains an optimization. The functions
pid_delete_dentry and pid_revalidate ensure that ordinary dcache
management will not attempt to use dentries past the point their
respective task has died. When unused the shrinker will
eventually be able to remove these dentries.
There is a case in de_thread where proc_flush_pid can be
called early for a given pid. Which winds up being
safe (if suboptimal) as this is just an optiimization.
Only pid directories are put on the list as the other
per pid files are children of those directories and
d_invalidate on the directory will get them as well.
So that the pid can be used during flushing it's reference count is
taken in release_task and dropped in proc_flush_pid. Further the call
of proc_flush_pid is moved after the tasklist_lock is released in
release_task so that it is certain that the pid has already been
unhashed when flushing it taking place. This removes a small race
where a dentry could recreated.
As struct pid is supposed to be small and I need a per pid lock
I reuse the only lock that currently exists in struct pid the
the wait_pidfd.lock.
The net result is that this adds all of this functionality
with just a little extra list management overhead and
a single extra pointer in struct pid.
v2: Initialize pid->inodes. I somehow failed to get that
initialization into the initial version of the patch. A boot
failure was reported by "kernel test robot <lkp(a)intel.com>", and
failure to initialize that pid->inodes matches all of the reported
symptoms.
Signed-off-by: Eric W. Biederman <ebiederm(a)xmission.com>
Fixes: f333c700c610 ("pidns: Add a limit on the number of pid
namespaces")
Fixes: 60347f6716aa ("pid namespaces: prepare proc_flust_task() to flush
entries from multiple proc trees")
Cc: <stable(a)vger.kernel.org> # 4.9.x: b3e583825266: clone: add CLONE_PIDFD
Cc: <stable(a)vger.kernel.org> # 4.9.x: b53b0b9d9a61: pidfd: add polling
support
Cc: <stable(a)vger.kernel.org> # 4.9.x: db978da8fa1d: proc: Pass file mode to
proc_pid_make_inode
Cc: <stable(a)vger.kernel.org> # 4.9.x: 68eb94f16227: proc: Better ownership of
files for non-dumpable tasks in user namespaces
Cc: <stable(a)vger.kernel.org> # 4.9.x: e3912ac37e07: proc: use %u for pid
printing and slightly less stack
Cc: <stable(a)vger.kernel.org> # 4.9.x: 0afa5ca82212: proc: Rename in
proc_inode rename sysctl_inodes sibling_inodes
Cc: <stable(a)vger.kernel.org> # 4.9.x: 26dbc60f385f: proc: Generalize
proc_sys_prune_dcache into proc_prune_siblings_dcache
Cc: <stable(a)vger.kernel.org> # 4.9.x: 71448011ea2a: proc: Clear the pieces of
proc_inode that proc_evict_inode cares about
Cc: <stable(a)vger.kernel.org> # 4.9.x: f90f3cafe8d5: Use d_invalidate in
proc_prune_siblings_dcache
Cc: <stable(a)vger.kernel.org> # 4.9.x
(proc: fix up cherry-pick conflicts for 7bc3e6e55acf)
Signed-off-by: Wen Yang <wenyang(a)linux.alibaba.com>
---
fs/proc/base.c | 111 ++++++++++++++++--------------------------------
fs/proc/inode.c | 2 +-
fs/proc/internal.h | 1 +
include/linux/pid.h | 1 +
include/linux/proc_fs.h | 4 +-
kernel/exit.c | 5 ++-
kernel/pid.c | 1 +
7 files changed, 45 insertions(+), 80 deletions(-)
diff --git a/fs/proc/base.c b/fs/proc/base.c
index 3502a40..11caf35 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -1728,11 +1728,25 @@ void task_dump_owner(struct task_struct *task, mode_t mode,
*rgid = gid;
}
+void proc_pid_evict_inode(struct proc_inode *ei)
+{
+ struct pid *pid = ei->pid;
+
+ if (S_ISDIR(ei->vfs_inode.i_mode)) {
+ spin_lock(&pid->wait_pidfd.lock);
+ hlist_del_init_rcu(&ei->sibling_inodes);
+ spin_unlock(&pid->wait_pidfd.lock);
+ }
+
+ put_pid(pid);
+}
+
struct inode *proc_pid_make_inode(struct super_block * sb,
struct task_struct *task, umode_t mode)
{
struct inode * inode;
struct proc_inode *ei;
+ struct pid *pid;
/* We need a new inode */
@@ -1750,10 +1764,18 @@ struct inode *proc_pid_make_inode(struct super_block * sb,
/*
* grab the reference to task.
*/
- ei->pid = get_task_pid(task, PIDTYPE_PID);
- if (!ei->pid)
+ pid = get_task_pid(task, PIDTYPE_PID);
+ if (!pid)
goto out_unlock;
+ /* Let the pid remember us for quick removal */
+ ei->pid = pid;
+ if (S_ISDIR(mode)) {
+ spin_lock(&pid->wait_pidfd.lock);
+ hlist_add_head_rcu(&ei->sibling_inodes, &pid->inodes);
+ spin_unlock(&pid->wait_pidfd.lock);
+ }
+
task_dump_owner(task, 0, &inode->i_uid, &inode->i_gid);
security_task_to_inode(task, inode);
@@ -3015,90 +3037,29 @@ static struct dentry *proc_tgid_base_lookup(struct inode *dir, struct dentry *de
.permission = proc_pid_permission,
};
-static void proc_flush_task_mnt(struct vfsmount *mnt, pid_t pid, pid_t tgid)
-{
- struct dentry *dentry, *leader, *dir;
- char buf[10 + 1];
- struct qstr name;
-
- name.name = buf;
- name.len = snprintf(buf, sizeof(buf), "%u", pid);
- /* no ->d_hash() rejects on procfs */
- dentry = d_hash_and_lookup(mnt->mnt_root, &name);
- if (dentry) {
- d_invalidate(dentry);
- dput(dentry);
- }
-
- if (pid == tgid)
- return;
-
- name.name = buf;
- name.len = snprintf(buf, sizeof(buf), "%u", tgid);
- leader = d_hash_and_lookup(mnt->mnt_root, &name);
- if (!leader)
- goto out;
-
- name.name = "task";
- name.len = strlen(name.name);
- dir = d_hash_and_lookup(leader, &name);
- if (!dir)
- goto out_put_leader;
-
- name.name = buf;
- name.len = snprintf(buf, sizeof(buf), "%u", pid);
- dentry = d_hash_and_lookup(dir, &name);
- if (dentry) {
- d_invalidate(dentry);
- dput(dentry);
- }
-
- dput(dir);
-out_put_leader:
- dput(leader);
-out:
- return;
-}
-
/**
- * proc_flush_task - Remove dcache entries for @task from the /proc dcache.
- * @task: task that should be flushed.
+ * proc_flush_pid - Remove dcache entries for @pid from the /proc dcache.
+ * @pid: pid that should be flushed.
*
- * When flushing dentries from proc, one needs to flush them from global
- * proc (proc_mnt) and from all the namespaces' procs this task was seen
- * in. This call is supposed to do all of this job.
- *
- * Looks in the dcache for
- * /proc/@pid
- * /proc/@tgid/task/@pid
- * if either directory is present flushes it and all of it'ts children
- * from the dcache.
+ * This function walks a list of inodes (that belong to any proc
+ * filesystem) that are attached to the pid and flushes them from
+ * the dentry cache.
*
* It is safe and reasonable to cache /proc entries for a task until
* that task exits. After that they just clog up the dcache with
* useless entries, possibly causing useful dcache entries to be
- * flushed instead. This routine is proved to flush those useless
- * dcache entries at process exit time.
+ * flushed instead. This routine is provided to flush those useless
+ * dcache entries when a process is reaped.
*
* NOTE: This routine is just an optimization so it does not guarantee
- * that no dcache entries will exist at process exit time it
- * just makes it very unlikely that any will persist.
+ * that no dcache entries will exist after a process is reaped
+ * it just makes it very unlikely that any will persist.
*/
-void proc_flush_task(struct task_struct *task)
+void proc_flush_pid(struct pid *pid)
{
- int i;
- struct pid *pid, *tgid;
- struct upid *upid;
-
- pid = task_pid(task);
- tgid = task_tgid(task);
-
- for (i = 0; i <= pid->level; i++) {
- upid = &pid->numbers[i];
- proc_flush_task_mnt(upid->ns->proc_mnt, upid->nr,
- tgid->numbers[i].nr);
- }
+ proc_invalidate_siblings_dcache(&pid->inodes, &pid->wait_pidfd.lock);
+ put_pid(pid);
}
static int proc_pid_instantiate(struct inode *dir,
diff --git a/fs/proc/inode.c b/fs/proc/inode.c
index 2af9f4f..8503444 100644
--- a/fs/proc/inode.c
+++ b/fs/proc/inode.c
@@ -39,7 +39,7 @@ static void proc_evict_inode(struct inode *inode)
/* Stop tracking associated processes */
if (ei->pid) {
- put_pid(ei->pid);
+ proc_pid_evict_inode(ei);
ei->pid = NULL;
}
diff --git a/fs/proc/internal.h b/fs/proc/internal.h
index 6a1d679..0c6ca639 100644
--- a/fs/proc/internal.h
+++ b/fs/proc/internal.h
@@ -151,6 +151,7 @@ extern int proc_pid_statm(struct seq_file *, struct pid_namespace *,
extern const struct dentry_operations pid_dentry_operations;
extern int pid_getattr(struct vfsmount *, struct dentry *, struct kstat *);
extern int proc_setattr(struct dentry *, struct iattr *);
+extern void proc_pid_evict_inode(struct proc_inode *);
extern struct inode *proc_pid_make_inode(struct super_block *, struct task_struct *, umode_t);
extern int pid_revalidate(struct dentry *, unsigned int);
extern int pid_delete_dentry(const struct dentry *);
diff --git a/include/linux/pid.h b/include/linux/pid.h
index f5552ba..04b4aaa 100644
--- a/include/linux/pid.h
+++ b/include/linux/pid.h
@@ -63,6 +63,7 @@ struct pid
unsigned int level;
/* lists of tasks that use this pid */
struct hlist_head tasks[PIDTYPE_MAX];
+ struct hlist_head inodes;
/* wait queue for pidfd notifications */
wait_queue_head_t wait_pidfd;
struct rcu_head rcu;
diff --git a/include/linux/proc_fs.h b/include/linux/proc_fs.h
index b97bf2e..d3580f5 100644
--- a/include/linux/proc_fs.h
+++ b/include/linux/proc_fs.h
@@ -12,7 +12,7 @@
#ifdef CONFIG_PROC_FS
extern void proc_root_init(void);
-extern void proc_flush_task(struct task_struct *);
+extern void proc_flush_pid(struct pid *);
extern struct proc_dir_entry *proc_symlink(const char *,
struct proc_dir_entry *, const char *);
@@ -48,7 +48,7 @@ static inline void proc_root_init(void)
{
}
-static inline void proc_flush_task(struct task_struct *task)
+static inline void proc_flush_pid(struct pid *pid)
{
}
diff --git a/kernel/exit.c b/kernel/exit.c
index f9943ef..5e66030 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -168,6 +168,7 @@ static void delayed_put_task_struct(struct rcu_head *rhp)
void release_task(struct task_struct *p)
{
struct task_struct *leader;
+ struct pid *thread_pid;
int zap_leader;
repeat:
/* don't need to get the RCU readlock here - the process is dead and
@@ -176,10 +177,9 @@ void release_task(struct task_struct *p)
atomic_dec(&__task_cred(p)->user->processes);
rcu_read_unlock();
- proc_flush_task(p);
-
write_lock_irq(&tasklist_lock);
ptrace_release_task(p);
+ thread_pid = get_pid(p->pids[PIDTYPE_PID].pid);
__exit_signal(p);
/*
@@ -202,6 +202,7 @@ void release_task(struct task_struct *p)
}
write_unlock_irq(&tasklist_lock);
+ proc_flush_pid(thread_pid);
release_thread(p);
call_rcu(&p->rcu, delayed_put_task_struct);
diff --git a/kernel/pid.c b/kernel/pid.c
index e605398..fb32a81 100644
--- a/kernel/pid.c
+++ b/kernel/pid.c
@@ -334,6 +334,7 @@ struct pid *alloc_pid(struct pid_namespace *ns)
INIT_HLIST_HEAD(&pid->tasks[type]);
init_waitqueue_head(&pid->wait_pidfd);
+ INIT_HLIST_HEAD(&pid->inodes);
upid = pid->numbers + ns->level;
spin_lock_irq(&pidmap_lock);
--
1.8.3.1
From: "Eric W. Biederman" <ebiederm(a)xmission.com>
[ Upstream commit f90f3cafe8d56d593fc509a4185da1d5800efea4 ]
The function d_prune_aliases has the problem that it will only prune
aliases thare are completely unused. It will not remove aliases for
the dcache or even think of removing mounts from the dcache. For that
behavior d_invalidate is needed.
To use d_invalidate replace d_prune_aliases with d_find_alias followed
by d_invalidate and dput.
For completeness the directory and the non-directory cases are
separated because in theory (although not in currently in practice for
proc) directories can only ever have a single dentry while
non-directories can have hardlinks and thus multiple dentries.
As part of this separation use d_find_any_alias for directories
to spare d_find_alias the extra work of doing that.
Plus the differences between d_find_any_alias and d_find_alias makes
it clear why the directory and non-directory code and not share code.
To make it clear these routines now invalidate dentries rename
proc_prune_siblings_dache to proc_invalidate_siblings_dcache, and rename
proc_sys_prune_dcache proc_sys_invalidate_dcache.
V2: Split the directory and non-directory cases. To make this
code robust to future changes in proc.
Signed-off-by: "Eric W. Biederman" <ebiederm(a)xmission.com>
Cc: <stable(a)vger.kernel.org> # 4.9.x
(proc: fix up cherry-pick conflicts for f90f3cafe8d5)
Signed-off-by: Wen Yang <wenyang(a)linux.alibaba.com>
---
fs/proc/inode.c | 16 ++++++++++++++--
fs/proc/internal.h | 2 +-
fs/proc/proc_sysctl.c | 8 ++++----
3 files changed, 19 insertions(+), 7 deletions(-)
diff --git a/fs/proc/inode.c b/fs/proc/inode.c
index 739fb9c..2af9f4f 100644
--- a/fs/proc/inode.c
+++ b/fs/proc/inode.c
@@ -107,7 +107,7 @@ void __init proc_init_inodecache(void)
init_once);
}
-void proc_prune_siblings_dcache(struct hlist_head *inodes, spinlock_t *lock)
+void proc_invalidate_siblings_dcache(struct hlist_head *inodes, spinlock_t *lock)
{
struct inode *inode;
struct proc_inode *ei;
@@ -136,7 +136,19 @@ void proc_prune_siblings_dcache(struct hlist_head *inodes, spinlock_t *lock)
continue;
}
- d_prune_aliases(inode);
+ if (S_ISDIR(inode->i_mode)) {
+ struct dentry *dir = d_find_any_alias(inode);
+ if (dir) {
+ d_invalidate(dir);
+ dput(dir);
+ }
+ } else {
+ struct dentry *dentry;
+ while ((dentry = d_find_alias(inode))) {
+ d_invalidate(dentry);
+ dput(dentry);
+ }
+ }
iput(inode);
deactivate_super(sb);
diff --git a/fs/proc/internal.h b/fs/proc/internal.h
index 9bc44a1..6a1d679 100644
--- a/fs/proc/internal.h
+++ b/fs/proc/internal.h
@@ -200,7 +200,7 @@ struct pde_opener {
extern const struct inode_operations proc_pid_link_inode_operations;
extern void proc_init_inodecache(void);
-void proc_prune_siblings_dcache(struct hlist_head *inodes, spinlock_t *lock);
+void proc_invalidate_siblings_dcache(struct hlist_head *inodes, spinlock_t *lock);
extern struct inode *proc_get_inode(struct super_block *, struct proc_dir_entry *);
extern int proc_fill_super(struct super_block *, void *data, int flags);
extern void proc_entry_rundown(struct proc_dir_entry *);
diff --git a/fs/proc/proc_sysctl.c b/fs/proc/proc_sysctl.c
index f19063b..b6668a5 100644
--- a/fs/proc/proc_sysctl.c
+++ b/fs/proc/proc_sysctl.c
@@ -260,9 +260,9 @@ static void unuse_table(struct ctl_table_header *p)
complete(p->unregistering);
}
-static void proc_sys_prune_dcache(struct ctl_table_header *head)
+static void proc_sys_invalidate_dcache(struct ctl_table_header *head)
{
- proc_prune_siblings_dcache(&head->inodes, &sysctl_lock);
+ proc_invalidate_siblings_dcache(&head->inodes, &sysctl_lock);
}
/* called under sysctl_lock, will reacquire if has to wait */
@@ -284,10 +284,10 @@ static void start_unregistering(struct ctl_table_header *p)
spin_unlock(&sysctl_lock);
}
/*
- * Prune dentries for unregistered sysctls: namespaced sysctls
+ * Invalidate dentries for unregistered sysctls: namespaced sysctls
* can have duplicate names and contaminate dcache very badly.
*/
- proc_sys_prune_dcache(p);
+ proc_sys_invalidate_dcache(p);
/*
* do not remove from the list until nobody holds it; walking the
* list in do_sysctl() relies on that.
--
1.8.3.1
From: "Eric W. Biederman" <ebiederm(a)xmission.com>
[ Upstream commit 71448011ea2a1cd36d8f5cbdab0ed716c454d565 ]
This just keeps everything tidier, and allows for using flags like
SLAB_TYPESAFE_BY_RCU where slabs are not always cleared before reuse.
I don't see reuse without reinitializing happening with the proc_inode
but I had a false alarm while reworking flushing of proc dentries and
indoes when a process dies that caused me to tidy this up.
The code is a little easier to follow and reason about this
way so I figured the changes might as well be kept.
Signed-off-by: "Eric W. Biederman" <ebiederm(a)xmission.com>
Cc: <stable(a)vger.kernel.org> # 4.9.x
Signed-off-by: Wen Yang <wenyang(a)linux.alibaba.com>
---
fs/proc/inode.c | 16 +++++++++++-----
1 file changed, 11 insertions(+), 5 deletions(-)
diff --git a/fs/proc/inode.c b/fs/proc/inode.c
index 920c761..739fb9c 100644
--- a/fs/proc/inode.c
+++ b/fs/proc/inode.c
@@ -32,21 +32,27 @@ static void proc_evict_inode(struct inode *inode)
{
struct proc_dir_entry *de;
struct ctl_table_header *head;
+ struct proc_inode *ei = PROC_I(inode);
truncate_inode_pages_final(&inode->i_data);
clear_inode(inode);
/* Stop tracking associated processes */
- put_pid(PROC_I(inode)->pid);
+ if (ei->pid) {
+ put_pid(ei->pid);
+ ei->pid = NULL;
+ }
/* Let go of any associated proc directory entry */
- de = PDE(inode);
- if (de)
+ de = ei->pde;
+ if (de) {
pde_put(de);
+ ei->pde = NULL;
+ }
- head = PROC_I(inode)->sysctl;
+ head = ei->sysctl;
if (head) {
- RCU_INIT_POINTER(PROC_I(inode)->sysctl, NULL);
+ RCU_INIT_POINTER(ei->sysctl, NULL);
proc_sys_evict_inode(inode, head);
}
}
--
1.8.3.1
From: "Joel Fernandes (Google)" <joel(a)joelfernandes.org>
[ Upstream commit b53b0b9d9a613c418057f6cb921c2f40a6f78c24 ]
This patch adds polling support to pidfd.
Android low memory killer (LMK) needs to know when a process dies once
it is sent the kill signal. It does so by checking for the existence of
/proc/pid which is both racy and slow. For example, if a PID is reused
between when LMK sends a kill signal and checks for existence of the
PID, since the wrong PID is now possibly checked for existence.
Using the polling support, LMK will be able to get notified when a process
exists in race-free and fast way, and allows the LMK to do other things
(such as by polling on other fds) while awaiting the process being killed
to die.
For notification to polling processes, we follow the same existing
mechanism in the kernel used when the parent of the task group is to be
notified of a child's death (do_notify_parent). This is precisely when the
tasks waiting on a poll of pidfd are also awakened in this patch.
We have decided to include the waitqueue in struct pid for the following
reasons:
1. The wait queue has to survive for the lifetime of the poll. Including
it in task_struct would not be option in this case because the task can
be reaped and destroyed before the poll returns.
2. By including the struct pid for the waitqueue means that during
de_thread(), the new thread group leader automatically gets the new
waitqueue/pid even though its task_struct is different.
Appropriate test cases are added in the second patch to provide coverage of
all the cases the patch is handling.
Cc: Andy Lutomirski <luto(a)amacapital.net>
Cc: Steven Rostedt <rostedt(a)goodmis.org>
Cc: Daniel Colascione <dancol(a)google.com>
Cc: Jann Horn <jannh(a)google.com>
Cc: Tim Murray <timmurray(a)google.com>
Cc: Jonathan Kowalski <bl0pbl33p(a)gmail.com>
Cc: Linus Torvalds <torvalds(a)linux-foundation.org>
Cc: Al Viro <viro(a)zeniv.linux.org.uk>
Cc: Kees Cook <keescook(a)chromium.org>
Cc: David Howells <dhowells(a)redhat.com>
Cc: Oleg Nesterov <oleg(a)redhat.com>
Cc: kernel-team(a)android.com
Reviewed-by: Oleg Nesterov <oleg(a)redhat.com>
Co-developed-by: Daniel Colascione <dancol(a)google.com>
Signed-off-by: Daniel Colascione <dancol(a)google.com>
Signed-off-by: Joel Fernandes (Google) <joel(a)joelfernandes.org>
Signed-off-by: Christian Brauner <christian(a)brauner.io>
Cc: <stable(a)vger.kernel.org> # 4.9.x
(pidfd: fix up cherry-pick conflicts for b53b0b9d9a61)
Signed-off-by: Wen Yang <wenyang(a)linux.alibaba.com>
---
include/linux/pid.h | 3 +++
kernel/fork.c | 26 ++++++++++++++++++++++++++
kernel/pid.c | 2 ++
kernel/signal.c | 11 +++++++++++
4 files changed, 42 insertions(+)
diff --git a/include/linux/pid.h b/include/linux/pid.h
index 7599a78..f5552ba 100644
--- a/include/linux/pid.h
+++ b/include/linux/pid.h
@@ -2,6 +2,7 @@
#define _LINUX_PID_H
#include <linux/rcupdate.h>
+#include <linux/wait.h>
enum pid_type
{
@@ -62,6 +63,8 @@ struct pid
unsigned int level;
/* lists of tasks that use this pid */
struct hlist_head tasks[PIDTYPE_MAX];
+ /* wait queue for pidfd notifications */
+ wait_queue_head_t wait_pidfd;
struct rcu_head rcu;
struct upid numbers[1];
};
diff --git a/kernel/fork.c b/kernel/fork.c
index 4249f60..e3a4a14 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1481,8 +1481,34 @@ static void pidfd_show_fdinfo(struct seq_file *m, struct file *f)
}
#endif
+/*
+ * Poll support for process exit notification.
+ */
+static unsigned int pidfd_poll(struct file *file, struct poll_table_struct *pts)
+{
+ struct task_struct *task;
+ struct pid *pid = file->private_data;
+ int poll_flags = 0;
+
+ poll_wait(file, &pid->wait_pidfd, pts);
+
+ rcu_read_lock();
+ task = pid_task(pid, PIDTYPE_PID);
+ /*
+ * Inform pollers only when the whole thread group exits.
+ * If the thread group leader exits before all other threads in the
+ * group, then poll(2) should block, similar to the wait(2) family.
+ */
+ if (!task || (task->exit_state && thread_group_empty(task)))
+ poll_flags = POLLIN | POLLRDNORM;
+ rcu_read_unlock();
+
+ return poll_flags;
+}
+
const struct file_operations pidfd_fops = {
.release = pidfd_release,
+ .poll = pidfd_poll,
#ifdef CONFIG_PROC_FS
.show_fdinfo = pidfd_show_fdinfo,
#endif
diff --git a/kernel/pid.c b/kernel/pid.c
index fa704f8..e605398 100644
--- a/kernel/pid.c
+++ b/kernel/pid.c
@@ -333,6 +333,8 @@ struct pid *alloc_pid(struct pid_namespace *ns)
for (type = 0; type < PIDTYPE_MAX; ++type)
INIT_HLIST_HEAD(&pid->tasks[type]);
+ init_waitqueue_head(&pid->wait_pidfd);
+
upid = pid->numbers + ns->level;
spin_lock_irq(&pidmap_lock);
if (!(ns->nr_hashed & PIDNS_HASH_ADDING))
diff --git a/kernel/signal.c b/kernel/signal.c
index bedca16..053de87a 100644
--- a/kernel/signal.c
+++ b/kernel/signal.c
@@ -1632,6 +1632,14 @@ int send_sigqueue(struct sigqueue *q, struct task_struct *t, int group)
return ret;
}
+static void do_notify_pidfd(struct task_struct *task)
+{
+ struct pid *pid;
+
+ pid = task_pid(task);
+ wake_up_all(&pid->wait_pidfd);
+}
+
/*
* Let a parent know about the death of a child.
* For a stopped/continued status change, use do_notify_parent_cldstop instead.
@@ -1655,6 +1663,9 @@ bool do_notify_parent(struct task_struct *tsk, int sig)
BUG_ON(!tsk->ptrace &&
(tsk->group_leader != tsk || !thread_group_empty(tsk)));
+ /* Wake up all pidfd waiters */
+ do_notify_pidfd(tsk);
+
if (sig != SIGCHLD) {
/*
* This is only possible if parent == real_parent.
--
1.8.3.1
From: Christian Brauner <christian(a)brauner.io>
[ Upstream commit b3e5838252665ee4cfa76b82bdf1198dca81e5be ]
This patchset makes it possible to retrieve pid file descriptors at
process creation time by introducing the new flag CLONE_PIDFD to the
clone() system call. Linus originally suggested to implement this as a
new flag to clone() instead of making it a separate system call. As
spotted by Linus, there is exactly one bit for clone() left.
CLONE_PIDFD creates file descriptors based on the anonymous inode
implementation in the kernel that will also be used to implement the new
mount api. They serve as a simple opaque handle on pids. Logically,
this makes it possible to interpret a pidfd differently, narrowing or
widening the scope of various operations (e.g. signal sending). Thus, a
pidfd cannot just refer to a tgid, but also a tid, or in theory - given
appropriate flag arguments in relevant syscalls - a process group or
session. A pidfd does not represent a privilege. This does not imply it
cannot ever be that way but for now this is not the case.
A pidfd comes with additional information in fdinfo if the kernel supports
procfs. The fdinfo file contains the pid of the process in the callers
pid namespace in the same format as the procfs status file, i.e. "Pid:\t%d".
As suggested by Oleg, with CLONE_PIDFD the pidfd is returned in the
parent_tidptr argument of clone. This has the advantage that we can
give back the associated pid and the pidfd at the same time.
To remove worries about missing metadata access this patchset comes with
a sample program that illustrates how a combination of CLONE_PIDFD, and
pidfd_send_signal() can be used to gain race-free access to process
metadata through /proc/<pid>. The sample program can easily be
translated into a helper that would be suitable for inclusion in libc so
that users don't have to worry about writing it themselves.
Suggested-by: Linus Torvalds <torvalds(a)linux-foundation.org>
Signed-off-by: Christian Brauner <christian(a)brauner.io>
Co-developed-by: Jann Horn <jannh(a)google.com>
Signed-off-by: Jann Horn <jannh(a)google.com>
Reviewed-by: Oleg Nesterov <oleg(a)redhat.com>
Cc: Arnd Bergmann <arnd(a)arndb.de>
Cc: "Eric W. Biederman" <ebiederm(a)xmission.com>
Cc: Kees Cook <keescook(a)chromium.org>
Cc: Thomas Gleixner <tglx(a)linutronix.de>
Cc: David Howells <dhowells(a)redhat.com>
Cc: "Michael Kerrisk (man-pages)" <mtk.manpages(a)gmail.com>
Cc: Andy Lutomirsky <luto(a)kernel.org>
Cc: Andrew Morton <akpm(a)linux-foundation.org>
Cc: Aleksa Sarai <cyphar(a)cyphar.com>
Cc: Linus Torvalds <torvalds(a)linux-foundation.org>
Cc: Al Viro <viro(a)zeniv.linux.org.uk>
Cc: <stable(a)vger.kernel.org> # 4.9.x
(clone: fix up cherry-pick conflicts for b3e583825266)
Signed-off-by: Wen Yang <wenyang(a)linux.alibaba.com>
---
include/linux/pid.h | 1 +
include/uapi/linux/sched.h | 1 +
kernel/fork.c | 105 +++++++++++++++++++++++++++++++++++++++++++--
3 files changed, 103 insertions(+), 4 deletions(-)
diff --git a/include/linux/pid.h b/include/linux/pid.h
index 97b745d..7599a78 100644
--- a/include/linux/pid.h
+++ b/include/linux/pid.h
@@ -73,6 +73,7 @@ struct pid_link
struct hlist_node node;
struct pid *pid;
};
+extern const struct file_operations pidfd_fops;
static inline struct pid *get_pid(struct pid *pid)
{
diff --git a/include/uapi/linux/sched.h b/include/uapi/linux/sched.h
index 5f0fe01..ed6e31d 100644
--- a/include/uapi/linux/sched.h
+++ b/include/uapi/linux/sched.h
@@ -9,6 +9,7 @@
#define CLONE_FS 0x00000200 /* set if fs info shared between processes */
#define CLONE_FILES 0x00000400 /* set if open files shared between processes */
#define CLONE_SIGHAND 0x00000800 /* set if signal handlers and blocked signals shared */
+#define CLONE_PIDFD 0x00001000 /* set if a pidfd should be placed in parent */
#define CLONE_PTRACE 0x00002000 /* set if we want to let tracing continue on the child too */
#define CLONE_VFORK 0x00004000 /* set if the parent wants the child to wake it up on mm_release */
#define CLONE_PARENT 0x00008000 /* set if we want to have the same parent as the cloner */
diff --git a/kernel/fork.c b/kernel/fork.c
index b64efec..4249f60 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -11,6 +11,7 @@
* management can be a bitch. See 'mm/memory.c': 'copy_page_range()'
*/
+#include <linux/anon_inodes.h>
#include <linux/slab.h>
#include <linux/init.h>
#include <linux/unistd.h>
@@ -1460,6 +1461,58 @@ static void posix_cpu_timers_init(struct task_struct *tsk)
task->pids[type].pid = pid;
}
+static int pidfd_release(struct inode *inode, struct file *file)
+{
+ struct pid *pid = file->private_data;
+
+ file->private_data = NULL;
+ put_pid(pid);
+ return 0;
+}
+
+#ifdef CONFIG_PROC_FS
+static void pidfd_show_fdinfo(struct seq_file *m, struct file *f)
+{
+ struct pid_namespace *ns = file_inode(m->file)->i_sb->s_fs_info;
+ struct pid *pid = f->private_data;
+
+ seq_put_decimal_ull(m, "Pid:\t", pid_nr_ns(pid, ns));
+ seq_putc(m, '\n');
+}
+#endif
+
+const struct file_operations pidfd_fops = {
+ .release = pidfd_release,
+#ifdef CONFIG_PROC_FS
+ .show_fdinfo = pidfd_show_fdinfo,
+#endif
+};
+
+/**
+ * pidfd_create() - Create a new pid file descriptor.
+ *
+ * @pid: struct pid that the pidfd will reference
+ *
+ * This creates a new pid file descriptor with the O_CLOEXEC flag set.
+ *
+ * Note, that this function can only be called after the fd table has
+ * been unshared to avoid leaking the pidfd to the new process.
+ *
+ * Return: On success, a cloexec pidfd is returned.
+ * On error, a negative errno number will be returned.
+ */
+static int pidfd_create(struct pid *pid)
+{
+ int fd;
+
+ fd = anon_inode_getfd("[pidfd]", &pidfd_fops, get_pid(pid),
+ O_RDWR | O_CLOEXEC);
+ if (fd < 0)
+ put_pid(pid);
+
+ return fd;
+}
+
/*
* This creates a new process as a copy of the old one,
* but does not actually start it yet.
@@ -1472,13 +1525,14 @@ static __latent_entropy struct task_struct *copy_process(
unsigned long clone_flags,
unsigned long stack_start,
unsigned long stack_size,
+ int __user *parent_tidptr,
int __user *child_tidptr,
struct pid *pid,
int trace,
unsigned long tls,
int node)
{
- int retval;
+ int pidfd = -1, retval;
struct task_struct *p;
if ((clone_flags & (CLONE_NEWNS|CLONE_FS)) == (CLONE_NEWNS|CLONE_FS))
@@ -1526,6 +1580,30 @@ static __latent_entropy struct task_struct *copy_process(
retval = security_task_create(clone_flags);
if (retval)
goto fork_out;
+ if (clone_flags & CLONE_PIDFD) {
+ int reserved;
+
+ /*
+ * - CLONE_PARENT_SETTID is useless for pidfds and also
+ * parent_tidptr is used to return pidfds.
+ * - CLONE_DETACHED is blocked so that we can potentially
+ * reuse it later for CLONE_PIDFD.
+ * - CLONE_THREAD is blocked until someone really needs it.
+ */
+ if (clone_flags &
+ (CLONE_DETACHED | CLONE_PARENT_SETTID | CLONE_THREAD))
+ return ERR_PTR(-EINVAL);
+
+ /*
+ * Verify that parent_tidptr is sane so we can potentially
+ * reuse it later.
+ */
+ if (get_user(reserved, parent_tidptr))
+ return ERR_PTR(-EFAULT);
+
+ if (reserved != 0)
+ return ERR_PTR(-EINVAL);
+ }
retval = -ENOMEM;
p = dup_task_struct(current, node);
@@ -1703,6 +1781,22 @@ static __latent_entropy struct task_struct *copy_process(
}
}
+ /*
+ * This has to happen after we've potentially unshared the file
+ * descriptor table (so that the pidfd doesn't leak into the child
+ * if the fd table isn't shared).
+ */
+ if (clone_flags & CLONE_PIDFD) {
+ retval = pidfd_create(pid);
+ if (retval < 0)
+ goto bad_fork_free_pid;
+
+ pidfd = retval;
+ retval = put_user(pidfd, parent_tidptr);
+ if (retval)
+ goto bad_fork_put_pidfd;
+ }
+
#ifdef CONFIG_BLOCK
p->plug = NULL;
#endif
@@ -1758,7 +1852,7 @@ static __latent_entropy struct task_struct *copy_process(
*/
retval = cgroup_can_fork(p);
if (retval)
- goto bad_fork_free_pid;
+ goto bad_fork_put_pidfd;
/*
* From this point on we must avoid any synchronous user-space
@@ -1869,6 +1963,9 @@ static __latent_entropy struct task_struct *copy_process(
spin_unlock(¤t->sighand->siglock);
write_unlock_irq(&tasklist_lock);
cgroup_cancel_fork(p);
+bad_fork_put_pidfd:
+ if (clone_flags & CLONE_PIDFD)
+ __close_fd(current->files, pidfd);
bad_fork_free_pid:
threadgroup_change_end(current);
if (pid != &init_struct_pid)
@@ -1928,7 +2025,7 @@ static inline void init_idle_pids(struct pid_link *links)
struct task_struct *fork_idle(int cpu)
{
struct task_struct *task;
- task = copy_process(CLONE_VM, 0, 0, NULL, &init_struct_pid, 0, 0,
+ task = copy_process(CLONE_VM, 0, 0, NULL, NULL, &init_struct_pid, 0, 0,
cpu_to_node(cpu));
if (!IS_ERR(task)) {
init_idle_pids(task->pids);
@@ -1973,7 +2070,7 @@ long _do_fork(unsigned long clone_flags,
trace = 0;
}
- p = copy_process(clone_flags, stack_start, stack_size,
+ p = copy_process(clone_flags, stack_start, stack_size, parent_tidptr,
child_tidptr, NULL, trace, tls, NUMA_NO_NODE);
add_latent_entropy();
/*
--
1.8.3.1
This is the start of the stable review cycle for the 5.10.3 release.
There are 40 patches in this series, all will be posted as a response
to this one. If anyone has any issues with these being applied, please
let me know.
Responses should be made by Fri, 25 Dec 2020 15:05:02 +0000.
Anything received after that time might be too late.
The whole patch series can be found in one patch at:
https://www.kernel.org/pub/linux/kernel/v5.x/stable-review/patch-5.10.3-rc1…
or in the git tree and branch at:
git://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable-rc.git linux-5.10.y
and the diffstat can be found below.
thanks,
greg k-h
-------------
Pseudo-Shortlog of commits:
Greg Kroah-Hartman <gregkh(a)linuxfoundation.org>
Linux 5.10.3-rc1
Dae R. Jeong <dae.r.jeong(a)kaist.ac.kr>
md: fix a warning caused by a race between concurrent md_ioctl()s
Anant Thazhemadam <anant.thazhemadam(a)gmail.com>
nl80211: validate key indexes for cfg80211_registered_device
Eric Biggers <ebiggers(a)google.com>
crypto: af_alg - avoid undefined behavior accessing salg_name
Antti Palosaari <crope(a)iki.fi>
media: msi2500: assign SPI bus number dynamically
Anant Thazhemadam <anant.thazhemadam(a)gmail.com>
fs: quota: fix array-index-out-of-bounds bug by passing correct argument to vfs_cleanup_quota_inode()
Jan Kara <jack(a)suse.cz>
quota: Sanity-check quota file headers on load
Peilin Ye <yepeilin.cs(a)gmail.com>
Bluetooth: Fix slab-out-of-bounds read in hci_le_direct_adv_report_evt()
Eric Biggers <ebiggers(a)google.com>
f2fs: prevent creating duplicate encrypted filenames
Eric Biggers <ebiggers(a)google.com>
ext4: prevent creating duplicate encrypted filenames
Eric Biggers <ebiggers(a)google.com>
ubifs: prevent creating duplicate encrypted filenames
Eric Biggers <ebiggers(a)google.com>
fscrypt: add fscrypt_is_nokey_name()
Eric Biggers <ebiggers(a)google.com>
fscrypt: remove kernel-internal constants from UAPI header
Alexey Kardashevskiy <aik(a)ozlabs.ru>
serial_core: Check for port state when tty is in error state
Julian Sax <jsbc(a)gmx.de>
HID: i2c-hid: add Vero K147 to descriptor override
Arnd Bergmann <arnd(a)arndb.de>
scsi: megaraid_sas: Check user-provided offsets
Jack Qiu <jack.qiu(a)huawei.com>
f2fs: init dirty_secmap incorrectly
Chao Yu <chao(a)kernel.org>
f2fs: fix to seek incorrect data offset in inline data file
Suzuki K Poulose <suzuki.poulose(a)arm.com>
coresight: etm4x: Handle TRCVIPCSSCTLR accesses
Suzuki K Poulose <suzuki.poulose(a)arm.com>
coresight: etm4x: Fix accesses to TRCPROCSELR
Suzuki K Poulose <suzuki.poulose(a)arm.com>
coresight: etm4x: Fix accesses to TRCCIDCTLR1
Suzuki K Poulose <suzuki.poulose(a)arm.com>
coresight: etm4x: Fix accesses to TRCVMIDCTLR1
Sai Prakash Ranjan <saiprakash.ranjan(a)codeaurora.org>
coresight: etm4x: Skip setting LPOVERRIDE bit for qcom, skip-power-up
Sai Prakash Ranjan <saiprakash.ranjan(a)codeaurora.org>
coresight: etb10: Fix possible NULL ptr dereference in etb_enable_perf()
Suzuki K Poulose <suzuki.poulose(a)arm.com>
coresight: tmc-etr: Fix barrier packet insertion for perf buffer
Mao Jinlong <jinlmao(a)codeaurora.org>
coresight: tmc-etr: Check if page is valid before dma_map_page()
Sai Prakash Ranjan <saiprakash.ranjan(a)codeaurora.org>
coresight: tmc-etf: Fix NULL ptr dereference in tmc_enable_etf_sink_perf()
Krzysztof Kozlowski <krzk(a)kernel.org>
ARM: dts: exynos: fix USB 3.0 pins supply being turned off on Odroid XU
Krzysztof Kozlowski <krzk(a)kernel.org>
ARM: dts: exynos: fix USB 3.0 VBUS control and over-current pins on Exynos5410
Krzysztof Kozlowski <krzk(a)kernel.org>
ARM: dts: exynos: fix roles of USB 3.0 ports on Odroid XU
Fabio Estevam <festevam(a)gmail.com>
usb: chipidea: ci_hdrc_imx: Pass DISABLE_DEVICE_STREAMING flag to imx6ul
Will McVicker <willmcvicker(a)google.com>
USB: gadget: f_rndis: fix bitrate for SuperSpeed and above
Jack Pham <jackp(a)codeaurora.org>
usb: gadget: f_fs: Re-use SS descriptors for SuperSpeedPlus
Will McVicker <willmcvicker(a)google.com>
USB: gadget: f_midi: setup SuperSpeed Plus descriptors
taehyun.cho <taehyun.cho(a)samsung.com>
USB: gadget: f_acm: add support for SuperSpeed Plus
Johan Hovold <johan(a)kernel.org>
USB: serial: option: add interface-number sanity check to flag handling
Dan Carpenter <dan.carpenter(a)oracle.com>
usb: mtu3: fix memory corruption in mtu3_debugfs_regset()
Nicolin Chen <nicoleotsuka(a)gmail.com>
soc/tegra: fuse: Fix index bug in get_process_id
Artem Labazov <123321artyom(a)gmail.com>
exfat: Avoid allocating upcase table using kcalloc()
Andi Kleen <ak(a)linux.intel.com>
x86/split-lock: Avoid returning with interrupts enabled
Thierry Reding <treding(a)nvidia.com>
net: ipconfig: Avoid spurious blank lines in boot log
-------------
Diffstat:
Makefile | 4 +-
arch/arm/boot/dts/exynos5410-odroidxu.dts | 6 ++-
arch/arm/boot/dts/exynos5410-pinctrl.dtsi | 28 ++++++++++++
arch/arm/boot/dts/exynos5410.dtsi | 4 ++
arch/x86/kernel/traps.c | 3 +-
crypto/af_alg.c | 10 +++--
drivers/hid/i2c-hid/i2c-hid-dmi-quirks.c | 8 ++++
drivers/hwtracing/coresight/coresight-etb10.c | 4 +-
drivers/hwtracing/coresight/coresight-etm4x-core.c | 41 ++++++++++-------
drivers/hwtracing/coresight/coresight-priv.h | 2 +
drivers/hwtracing/coresight/coresight-tmc-etf.c | 4 +-
drivers/hwtracing/coresight/coresight-tmc-etr.c | 4 +-
drivers/md/md.c | 7 ++-
drivers/media/usb/msi2500/msi2500.c | 2 +-
drivers/scsi/megaraid/megaraid_sas_base.c | 16 ++++---
drivers/soc/tegra/fuse/speedo-tegra210.c | 2 +-
drivers/tty/serial/serial_core.c | 4 ++
drivers/usb/chipidea/ci_hdrc_imx.c | 3 +-
drivers/usb/gadget/function/f_acm.c | 2 +-
drivers/usb/gadget/function/f_fs.c | 5 ++-
drivers/usb/gadget/function/f_midi.c | 6 +++
drivers/usb/gadget/function/f_rndis.c | 4 +-
drivers/usb/mtu3/mtu3_debugfs.c | 2 +-
drivers/usb/serial/option.c | 23 +++++++++-
fs/crypto/fscrypt_private.h | 9 ++--
fs/crypto/hooks.c | 5 ++-
fs/crypto/keyring.c | 2 +-
fs/crypto/keysetup.c | 4 +-
fs/crypto/policy.c | 5 ++-
fs/exfat/nls.c | 6 +--
fs/ext4/namei.c | 3 ++
fs/f2fs/f2fs.h | 2 +
fs/f2fs/file.c | 11 +++--
fs/f2fs/segment.c | 2 +-
fs/quota/dquot.c | 2 +-
fs/quota/quota_v2.c | 19 ++++++++
fs/ubifs/dir.c | 17 ++++++--
include/linux/fscrypt.h | 34 +++++++++++++++
include/uapi/linux/fscrypt.h | 5 +--
include/uapi/linux/if_alg.h | 16 +++++++
net/bluetooth/hci_event.c | 12 +++--
net/ipv4/ipconfig.c | 14 +++---
net/wireless/core.h | 2 +
net/wireless/nl80211.c | 7 +--
net/wireless/util.c | 51 ++++++++++++++++++----
45 files changed, 334 insertions(+), 88 deletions(-)
The patch below does not apply to the 4.4-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
>From 89deb1334252ea4a8491d47654811e28b0790364 Mon Sep 17 00:00:00 2001
From: Jonathan Cameron <Jonathan.Cameron(a)huawei.com>
Date: Sun, 20 Sep 2020 12:27:37 +0100
Subject: [PATCH] iio:magnetometer:mag3110: Fix alignment and data leak issues.
One of a class of bugs pointed out by Lars in a recent review.
iio_push_to_buffers_with_timestamp() assumes the buffer used is aligned
to the size of the timestamp (8 bytes). This is not guaranteed in
this driver which uses an array of smaller elements on the stack.
As Lars also noted this anti pattern can involve a leak of data to
userspace and that indeed can happen here. We close both issues by
moving to a suitable structure in the iio_priv() data.
This data is allocated with kzalloc() so no data can leak apart from
previous readings.
The explicit alignment of ts is not necessary in this case but
does make the code slightly less fragile so I have included it.
Fixes: 39631b5f9584 ("iio: Add Freescale mag3110 magnetometer driver")
Reported-by: Lars-Peter Clausen <lars(a)metafoo.de>
Signed-off-by: Jonathan Cameron <Jonathan.Cameron(a)huawei.com>
Reviewed-by: Alexandru Ardelean <alexandru.ardelean(a)analog.com>
Cc: <Stable(a)vger.kernel.org>
Link: https://lore.kernel.org/r/20200920112742.170751-4-jic23@kernel.org
diff --git a/drivers/iio/magnetometer/mag3110.c b/drivers/iio/magnetometer/mag3110.c
index 838b13c8bb3d..c96415a1aead 100644
--- a/drivers/iio/magnetometer/mag3110.c
+++ b/drivers/iio/magnetometer/mag3110.c
@@ -56,6 +56,12 @@ struct mag3110_data {
int sleep_val;
struct regulator *vdd_reg;
struct regulator *vddio_reg;
+ /* Ensure natural alignment of timestamp */
+ struct {
+ __be16 channels[3];
+ u8 temperature;
+ s64 ts __aligned(8);
+ } scan;
};
static int mag3110_request(struct mag3110_data *data)
@@ -387,10 +393,9 @@ static irqreturn_t mag3110_trigger_handler(int irq, void *p)
struct iio_poll_func *pf = p;
struct iio_dev *indio_dev = pf->indio_dev;
struct mag3110_data *data = iio_priv(indio_dev);
- u8 buffer[16]; /* 3 16-bit channels + 1 byte temp + padding + ts */
int ret;
- ret = mag3110_read(data, (__be16 *) buffer);
+ ret = mag3110_read(data, data->scan.channels);
if (ret < 0)
goto done;
@@ -399,10 +404,10 @@ static irqreturn_t mag3110_trigger_handler(int irq, void *p)
MAG3110_DIE_TEMP);
if (ret < 0)
goto done;
- buffer[6] = ret;
+ data->scan.temperature = ret;
}
- iio_push_to_buffers_with_timestamp(indio_dev, buffer,
+ iio_push_to_buffers_with_timestamp(indio_dev, &data->scan,
iio_get_time_ns(indio_dev));
done:
The patch below does not apply to the 4.9-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
>From 7b6b51234df6cd8b04fe736b0b89c25612d896b8 Mon Sep 17 00:00:00 2001
From: Jonathan Cameron <Jonathan.Cameron(a)huawei.com>
Date: Sun, 20 Sep 2020 12:27:39 +0100
Subject: [PATCH] iio:imu:bmi160: Fix alignment and data leak issues
One of a class of bugs pointed out by Lars in a recent review.
iio_push_to_buffers_with_timestamp assumes the buffer used is aligned
to the size of the timestamp (8 bytes). This is not guaranteed in
this driver which uses an array of smaller elements on the stack.
As Lars also noted this anti pattern can involve a leak of data to
userspace and that indeed can happen here. We close both issues by
moving to a suitable array in the iio_priv() data with alignment
explicitly requested. This data is allocated with kzalloc() so no
data can leak apart from previous readings.
In this driver, depending on which channels are enabled, the timestamp
can be in a number of locations. Hence we cannot use a structure
to specify the data layout without it being misleading.
Fixes: 77c4ad2d6a9b ("iio: imu: Add initial support for Bosch BMI160")
Reported-by: Lars-Peter Clausen <lars(a)metafoo.de>
Signed-off-by: Jonathan Cameron <Jonathan.Cameron(a)huawei.com>
Reviewed-by: Alexandru Ardelean <alexandru.ardelean(a)analog.com>
Cc: Daniel Baluta <daniel.baluta(a)gmail.com>
Cc: Daniel Baluta <daniel.baluta(a)oss.nxp.com>
Cc: <Stable(a)vger.kernel.org>
Link: https://lore.kernel.org/r/20200920112742.170751-6-jic23@kernel.org
diff --git a/drivers/iio/imu/bmi160/bmi160.h b/drivers/iio/imu/bmi160/bmi160.h
index a82e040bd109..32c2ea2d7112 100644
--- a/drivers/iio/imu/bmi160/bmi160.h
+++ b/drivers/iio/imu/bmi160/bmi160.h
@@ -10,6 +10,13 @@ struct bmi160_data {
struct iio_trigger *trig;
struct regulator_bulk_data supplies[2];
struct iio_mount_matrix orientation;
+ /*
+ * Ensure natural alignment for timestamp if present.
+ * Max length needed: 2 * 3 channels + 4 bytes padding + 8 byte ts.
+ * If fewer channels are enabled, less space may be needed, as
+ * long as the timestamp is still aligned to 8 bytes.
+ */
+ __le16 buf[12] __aligned(8);
};
extern const struct regmap_config bmi160_regmap_config;
diff --git a/drivers/iio/imu/bmi160/bmi160_core.c b/drivers/iio/imu/bmi160/bmi160_core.c
index c8e131c29043..290b5ef83f77 100644
--- a/drivers/iio/imu/bmi160/bmi160_core.c
+++ b/drivers/iio/imu/bmi160/bmi160_core.c
@@ -427,8 +427,6 @@ static irqreturn_t bmi160_trigger_handler(int irq, void *p)
struct iio_poll_func *pf = p;
struct iio_dev *indio_dev = pf->indio_dev;
struct bmi160_data *data = iio_priv(indio_dev);
- __le16 buf[12];
- /* 2 sens x 3 axis x __le16 + 2 x __le16 pad + 4 x __le16 tstamp */
int i, ret, j = 0, base = BMI160_REG_DATA_MAGN_XOUT_L;
__le16 sample;
@@ -438,10 +436,10 @@ static irqreturn_t bmi160_trigger_handler(int irq, void *p)
&sample, sizeof(sample));
if (ret)
goto done;
- buf[j++] = sample;
+ data->buf[j++] = sample;
}
- iio_push_to_buffers_with_timestamp(indio_dev, buf, pf->timestamp);
+ iio_push_to_buffers_with_timestamp(indio_dev, data->buf, pf->timestamp);
done:
iio_trigger_notify_done(indio_dev->trig);
return IRQ_HANDLED;
This is a note to let you know that I've just added the patch titled
USB: serial: iuu_phoenix: fix DMA from stack
to my usb git tree which can be found at
git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb.git
in the usb-linus branch.
The patch will show up in the next release of the linux-next tree
(usually sometime within the next 24 hours during the week.)
The patch will hopefully also be merged in Linus's tree for the
next -rc kernel release.
If you have any questions about this process, please let me know.
>From 54d0a3ab80f49f19ee916def62fe067596833403 Mon Sep 17 00:00:00 2001
From: Johan Hovold <johan(a)kernel.org>
Date: Mon, 4 Jan 2021 15:50:07 +0100
Subject: USB: serial: iuu_phoenix: fix DMA from stack
Stack-allocated buffers cannot be used for DMA (on all architectures) so
allocate the flush command buffer using kmalloc().
Fixes: 60a8fc017103 ("USB: add iuu_phoenix driver")
Cc: stable <stable(a)vger.kernel.org> # 2.6.25
Reviewed-by: Greg Kroah-Hartman <gregkh(a)linuxfoundation.org>
Signed-off-by: Johan Hovold <johan(a)kernel.org>
---
drivers/usb/serial/iuu_phoenix.c | 20 +++++++++++++++-----
1 file changed, 15 insertions(+), 5 deletions(-)
diff --git a/drivers/usb/serial/iuu_phoenix.c b/drivers/usb/serial/iuu_phoenix.c
index f1201d4de297..e8f06b41a503 100644
--- a/drivers/usb/serial/iuu_phoenix.c
+++ b/drivers/usb/serial/iuu_phoenix.c
@@ -532,23 +532,29 @@ static int iuu_uart_flush(struct usb_serial_port *port)
struct device *dev = &port->dev;
int i;
int status;
- u8 rxcmd = IUU_UART_RX;
+ u8 *rxcmd;
struct iuu_private *priv = usb_get_serial_port_data(port);
if (iuu_led(port, 0xF000, 0, 0, 0xFF) < 0)
return -EIO;
+ rxcmd = kmalloc(1, GFP_KERNEL);
+ if (!rxcmd)
+ return -ENOMEM;
+
+ rxcmd[0] = IUU_UART_RX;
+
for (i = 0; i < 2; i++) {
- status = bulk_immediate(port, &rxcmd, 1);
+ status = bulk_immediate(port, rxcmd, 1);
if (status != IUU_OPERATION_OK) {
dev_dbg(dev, "%s - uart_flush_write error\n", __func__);
- return status;
+ goto out_free;
}
status = read_immediate(port, &priv->len, 1);
if (status != IUU_OPERATION_OK) {
dev_dbg(dev, "%s - uart_flush_read error\n", __func__);
- return status;
+ goto out_free;
}
if (priv->len > 0) {
@@ -556,12 +562,16 @@ static int iuu_uart_flush(struct usb_serial_port *port)
status = read_immediate(port, priv->buf, priv->len);
if (status != IUU_OPERATION_OK) {
dev_dbg(dev, "%s - uart_flush_read error\n", __func__);
- return status;
+ goto out_free;
}
}
}
dev_dbg(dev, "%s - uart_flush_read OK!\n", __func__);
iuu_led(port, 0, 0xF000, 0, 0xFF);
+
+out_free:
+ kfree(rxcmd);
+
return status;
}
--
2.30.0
Hi Greg
Please consider adding
commit 71ac13457d9d1007effde65b54818106b2c2b525
Author: Uwe Kleine-König <u.kleine-koenig(a)pengutronix.de>
Date: Fri Dec 18 11:10:54 2020 +0100
rtc: pcf2127: only use watchdog when explicitly available
to the 5.10 stable queue. You will need the preparatory refactoring patch
commit 5d78533a0c53af9659227c803df944ba27cd56e0
Author: Uwe Kleine-König <u.kleine-koenig(a)pengutronix.de>
Date: Thu Sep 24 12:52:55 2020 +0200
rtc: pcf2127: move watchdog initialisation to a separate function
And if documentation is supposed to be kept up-to-date in the -stable
trees, you can pick
commit 320d159e2d63a97a40f24cd6dfda5a57eec65b91
Author: Rasmus Villemoes <rasmus.villemoes(a)prevas.dk>
Date: Fri Dec 18 11:10:53 2020 +0100
dt-bindings: rtc: add reset-source property
Thanks,
Rasmus
This is the start of the stable review cycle for the 4.19.165 release.
There are 29 patches in this series, all will be posted as a response
to this one. If anyone has any issues with these being applied, please
let me know.
Responses should be made by Thu, 07 Jan 2021 09:08:03 +0000.
Anything received after that time might be too late.
The whole patch series can be found in one patch at:
https://www.kernel.org/pub/linux/kernel/v4.x/stable-review/patch-4.19.165-r…
or in the git tree and branch at:
git://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable-rc.git linux-4.19.y
and the diffstat can be found below.
thanks,
greg k-h
-------------
Pseudo-Shortlog of commits:
Greg Kroah-Hartman <gregkh(a)linuxfoundation.org>
Linux 4.19.165-rc2
Hyeongseok Kim <hyeongseok(a)gmail.com>
dm verity: skip verity work if I/O error when system is shutting down
Takashi Iwai <tiwai(a)suse.de>
ALSA: pcm: Clear the full allocated memory at hw_params
Jessica Yu <jeyu(a)kernel.org>
module: delay kobject uevent until after module init call
Trond Myklebust <trond.myklebust(a)hammerspace.com>
NFSv4: Fix a pNFS layout related use-after-free race when freeing the inode
Qinglang Miao <miaoqinglang(a)huawei.com>
powerpc: sysdev: add missing iounmap() on error in mpic_msgr_probe()
Jan Kara <jack(a)suse.cz>
quota: Don't overflow quota file offsets
Miroslav Benes <mbenes(a)suse.cz>
module: set MODULE_STATE_GOING state when a module fails to load
Dinghao Liu <dinghao.liu(a)zju.edu.cn>
rtc: sun6i: Fix memleak in sun6i_rtc_clk_init
Boqun Feng <boqun.feng(a)gmail.com>
fcntl: Fix potential deadlock in send_sig{io, urg}()
Takashi Iwai <tiwai(a)suse.de>
ALSA: rawmidi: Access runtime->avail always in spinlock
Takashi Iwai <tiwai(a)suse.de>
ALSA: seq: Use bool for snd_seq_queue internal flags
Mauro Carvalho Chehab <mchehab+huawei(a)kernel.org>
media: gp8psk: initialize stats at power control logic
Anant Thazhemadam <anant.thazhemadam(a)gmail.com>
misc: vmw_vmci: fix kernel info-leak by initializing dbells in vmci_ctx_get_chkpt_doorbells()
Rustam Kovhaev <rkovhaev(a)gmail.com>
reiserfs: add check for an invalid ih_entry_count
Anant Thazhemadam <anant.thazhemadam(a)gmail.com>
Bluetooth: hci_h5: close serdev device and free hu in h5_close
Johan Hovold <johan(a)kernel.org>
of: fix linker-section match-table corruption
Damien Le Moal <damien.lemoal(a)wdc.com>
null_blk: Fix zone size initialization
Souptick Joarder <jrdr.linux(a)gmail.com>
xen/gntdev.c: Mark pages as dirty
Christophe Leroy <christophe.leroy(a)csgroup.eu>
powerpc/bitops: Fix possible undefined behaviour with fls() and fls64()
Paolo Bonzini <pbonzini(a)redhat.com>
KVM: x86: reinstate vendor-agnostic check on SPEC_CTRL cpuid bits
Paolo Bonzini <pbonzini(a)redhat.com>
KVM: SVM: relax conditions for allowing MSR_IA32_SPEC_CTRL accesses
Petr Vorel <petr.vorel(a)gmail.com>
uapi: move constants from <linux/kernel.h> to <linux/const.h>
Jan Kara <jack(a)suse.cz>
ext4: don't remount read-only with errors=continue on reboot
Eric Auger <eric.auger(a)redhat.com>
vfio/pci: Move dummy_resources_list init in vfio_pci_probe()
Eric Biggers <ebiggers(a)google.com>
ubifs: prevent creating duplicate encrypted filenames
Eric Biggers <ebiggers(a)google.com>
f2fs: prevent creating duplicate encrypted filenames
Eric Biggers <ebiggers(a)google.com>
ext4: prevent creating duplicate encrypted filenames
Eric Biggers <ebiggers(a)google.com>
fscrypt: add fscrypt_is_nokey_name()
Kevin Vigor <kvigor(a)gmail.com>
md/raid10: initialize r10_bio->read_slot before use.
-------------
Diffstat:
Makefile | 4 +--
arch/powerpc/include/asm/bitops.h | 23 ++++++++++++++--
arch/powerpc/sysdev/mpic_msgr.c | 2 +-
arch/x86/kvm/cpuid.h | 14 ++++++++++
arch/x86/kvm/svm.c | 9 ++----
arch/x86/kvm/vmx.c | 6 ++--
drivers/block/null_blk_zoned.c | 20 +++++++++-----
drivers/bluetooth/hci_h5.c | 8 ++++--
drivers/md/dm-verity-target.c | 12 +++++++-
drivers/md/raid10.c | 3 +-
drivers/media/usb/dvb-usb/gp8psk.c | 2 +-
drivers/misc/vmw_vmci/vmci_context.c | 2 +-
drivers/rtc/rtc-sun6i.c | 8 ++++--
drivers/vfio/pci/vfio_pci.c | 3 +-
drivers/xen/gntdev.c | 17 ++++++++----
fs/crypto/hooks.c | 10 +++----
fs/ext4/namei.c | 3 ++
fs/ext4/super.c | 14 ++++------
fs/f2fs/f2fs.h | 2 ++
fs/fcntl.c | 10 ++++---
fs/nfs/nfs4super.c | 2 +-
fs/nfs/pnfs.c | 33 ++++++++++++++++++++--
fs/nfs/pnfs.h | 5 ++++
fs/quota/quota_tree.c | 8 +++---
fs/reiserfs/stree.c | 6 ++++
fs/ubifs/dir.c | 17 +++++++++---
include/linux/fscrypt_notsupp.h | 5 ++++
include/linux/fscrypt_supp.h | 29 +++++++++++++++++++
include/linux/of.h | 1 +
include/uapi/linux/const.h | 5 ++++
include/uapi/linux/ethtool.h | 2 +-
include/uapi/linux/kernel.h | 9 +-----
include/uapi/linux/lightnvm.h | 2 +-
include/uapi/linux/mroute6.h | 2 +-
include/uapi/linux/netfilter/x_tables.h | 2 +-
include/uapi/linux/netlink.h | 2 +-
include/uapi/linux/sysctl.h | 2 +-
kernel/module.c | 6 ++--
sound/core/pcm_native.c | 9 ++++--
sound/core/rawmidi.c | 49 +++++++++++++++++++++++----------
sound/core/seq/seq_queue.h | 8 +++---
41 files changed, 275 insertions(+), 101 deletions(-)
Stack-allocated buffers cannot be used for DMA (on all architectures).
Replace the HP-channel macro with a helper function that allocates a
dedicated transfer buffer so that it can continue to be used with
arguments from the stack.
Note that the buffer is cleared on allocation as usblp_ctrl_msg()
returns success also on short transfers (the buffer is only used for
debugging).
Cc: stable(a)vger.kernel.org
Signed-off-by: Johan Hovold <johan(a)kernel.org>
---
drivers/usb/class/usblp.c | 21 +++++++++++++++++++--
1 file changed, 19 insertions(+), 2 deletions(-)
diff --git a/drivers/usb/class/usblp.c b/drivers/usb/class/usblp.c
index 67cbd42421be..134dc2005ce9 100644
--- a/drivers/usb/class/usblp.c
+++ b/drivers/usb/class/usblp.c
@@ -274,8 +274,25 @@ static int usblp_ctrl_msg(struct usblp *usblp, int request, int type, int dir, i
#define usblp_reset(usblp)\
usblp_ctrl_msg(usblp, USBLP_REQ_RESET, USB_TYPE_CLASS, USB_DIR_OUT, USB_RECIP_OTHER, 0, NULL, 0)
-#define usblp_hp_channel_change_request(usblp, channel, buffer) \
- usblp_ctrl_msg(usblp, USBLP_REQ_HP_CHANNEL_CHANGE_REQUEST, USB_TYPE_VENDOR, USB_DIR_IN, USB_RECIP_INTERFACE, channel, buffer, 1)
+static int usblp_hp_channel_change_request(struct usblp *usblp, int channel, u8 *new_channel)
+{
+ u8 *buf;
+ int ret;
+
+ buf = kzalloc(1, GFP_KERNEL);
+ if (!buf)
+ return -ENOMEM;
+
+ ret = usblp_ctrl_msg(usblp, USBLP_REQ_HP_CHANNEL_CHANGE_REQUEST,
+ USB_TYPE_VENDOR, USB_DIR_IN, USB_RECIP_INTERFACE,
+ channel, buf, 1);
+ if (ret == 0)
+ *new_channel = buf[0];
+
+ kfree(buf);
+
+ return ret;
+}
/*
* See the description for usblp_select_alts() below for the usage
--
2.26.2
It is possible to exit the nested guest mode, entered by
svm_set_nested_state prior to first vm entry to it (e.g due to pending event)
if the nested run was not pending during the migration.
In this case we must not switch to the nested msr permission bitmap.
Also add a warning to catch similar cases in the future.
CC: stable(a)vger.kernel.org
Fixes: a7d5c7ce41ac1 ("KVM: nSVM: delay MSR permission processing to first nested VM run")
Signed-off-by: Maxim Levitsky <mlevitsk(a)redhat.com>
---
arch/x86/kvm/svm/nested.c | 6 ++++++
1 file changed, 6 insertions(+)
diff --git a/arch/x86/kvm/svm/nested.c b/arch/x86/kvm/svm/nested.c
index 18b71e73a9935..6208d3a5a3fdb 100644
--- a/arch/x86/kvm/svm/nested.c
+++ b/arch/x86/kvm/svm/nested.c
@@ -199,6 +199,10 @@ static bool nested_svm_vmrun_msrpm(struct vcpu_svm *svm)
static bool svm_get_nested_state_pages(struct kvm_vcpu *vcpu)
{
struct vcpu_svm *svm = to_svm(vcpu);
+
+ if (WARN_ON_ONCE(!is_guest_mode(&svm->vcpu)))
+ return false;
+
if (!nested_svm_vmrun_msrpm(svm)) {
vcpu->run->exit_reason = KVM_EXIT_INTERNAL_ERROR;
vcpu->run->internal.suberror =
@@ -598,6 +602,8 @@ int nested_svm_vmexit(struct vcpu_svm *svm)
svm->nested.vmcb12_gpa = 0;
WARN_ON_ONCE(svm->nested.nested_run_pending);
+ kvm_clear_request(KVM_REQ_GET_NESTED_STATE_PAGES, &svm->vcpu);
+
/* in case we halted in L2 */
svm->vcpu.arch.mp_state = KVM_MP_STATE_RUNNABLE;
--
2.26.2
The old sync_core_before_usermode() comments said that a non-icache-syncing
return-to-usermode instruction is x86-specific and that all other
architectures automatically notice cross-modified code on return to
userspace. Based on my general understanding of how CPUs work and based on
my atttempt to read the ARM manual, this is not true at all. In fact, x86
seems to be a bit of an anomaly in the other direction: x86's IRET is
unusually heavyweight for a return-to-usermode instruction.
So let's drop any pretense that we can have a generic way implementation
behind membarrier's SYNC_CORE flush and require all architectures that opt
in to supply their own. This means x86, arm64, and powerpc for now. Let's
also rename the function from sync_core_before_usermode() to
membarrier_sync_core_before_usermode() because the precise flushing details
may very well be specific to membarrier, and even the concept of
"sync_core" in the kernel is mostly an x86-ism.
I admit that I'm rather surprised that the code worked at all on arm64,
and I'm suspicious that it has never been very well tested. My apologies
for not reviewing this more carefully in the first place.
Cc: Michael Ellerman <mpe(a)ellerman.id.au>
Cc: Benjamin Herrenschmidt <benh(a)kernel.crashing.org>
Cc: Paul Mackerras <paulus(a)samba.org>
Cc: linuxppc-dev(a)lists.ozlabs.org
Cc: Nicholas Piggin <npiggin(a)gmail.com>
Cc: Catalin Marinas <catalin.marinas(a)arm.com>
Cc: Will Deacon <will(a)kernel.org>
Cc: linux-arm-kernel(a)lists.infradead.org
Cc: Mathieu Desnoyers <mathieu.desnoyers(a)efficios.com>
Cc: x86(a)kernel.org
Cc: stable(a)vger.kernel.org
Fixes: 70216e18e519 ("membarrier: Provide core serializing command, *_SYNC_CORE")
Signed-off-by: Andy Lutomirski <luto(a)kernel.org>
---
Hi arm64 and powerpc people-
This is part of a series here:
https://git.kernel.org/pub/scm/linux/kernel/git/luto/linux.git/log/?h=x86/f…
Before I send out the whole series, I'm hoping that some arm64 and powerpc
people can help me verify that I did this patch right. Once I get
some feedback on this patch, I'll send out the whole pile. And once
*that's* done, I'll start giving the mm lazy stuff some serious thought.
The x86 part is already fixed in Linus' tree.
Thanks,
Andy
arch/arm64/include/asm/sync_core.h | 21 +++++++++++++++++++++
arch/powerpc/include/asm/sync_core.h | 20 ++++++++++++++++++++
arch/x86/Kconfig | 1 -
arch/x86/include/asm/sync_core.h | 7 +++----
include/linux/sched/mm.h | 1 -
include/linux/sync_core.h | 21 ---------------------
init/Kconfig | 3 ---
kernel/sched/membarrier.c | 15 +++++++++++----
8 files changed, 55 insertions(+), 34 deletions(-)
create mode 100644 arch/arm64/include/asm/sync_core.h
create mode 100644 arch/powerpc/include/asm/sync_core.h
delete mode 100644 include/linux/sync_core.h
diff --git a/arch/arm64/include/asm/sync_core.h b/arch/arm64/include/asm/sync_core.h
new file mode 100644
index 000000000000..5be4531caabd
--- /dev/null
+++ b/arch/arm64/include/asm/sync_core.h
@@ -0,0 +1,21 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _ASM_ARM64_SYNC_CORE_H
+#define _ASM_ARM64_SYNC_CORE_H
+
+#include <asm/barrier.h>
+
+/*
+ * Ensure that the CPU notices any instruction changes before the next time
+ * it returns to usermode.
+ */
+static inline void membarrier_sync_core_before_usermode(void)
+{
+ /*
+ * XXX: is this enough or do we need a DMB first to make sure that
+ * writes from other CPUs become visible to this CPU? We have an
+ * smp_mb() already, but that's not quite the same thing.
+ */
+ isb();
+}
+
+#endif /* _ASM_ARM64_SYNC_CORE_H */
diff --git a/arch/powerpc/include/asm/sync_core.h b/arch/powerpc/include/asm/sync_core.h
new file mode 100644
index 000000000000..71dfbe7794e5
--- /dev/null
+++ b/arch/powerpc/include/asm/sync_core.h
@@ -0,0 +1,20 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _ASM_POWERPC_SYNC_CORE_H
+#define _ASM_POWERPC_SYNC_CORE_H
+
+#include <asm/barrier.h>
+
+/*
+ * Ensure that the CPU notices any instruction changes before the next time
+ * it returns to usermode.
+ */
+static inline void membarrier_sync_core_before_usermode(void)
+{
+ /*
+ * XXX: I know basically nothing about powerpc cache management.
+ * Is this correct?
+ */
+ isync();
+}
+
+#endif /* _ASM_POWERPC_SYNC_CORE_H */
diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index b5137cc5b7b4..895f70fd4a61 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -81,7 +81,6 @@ config X86
select ARCH_HAS_SET_DIRECT_MAP
select ARCH_HAS_STRICT_KERNEL_RWX
select ARCH_HAS_STRICT_MODULE_RWX
- select ARCH_HAS_SYNC_CORE_BEFORE_USERMODE
select ARCH_HAS_SYSCALL_WRAPPER
select ARCH_HAS_UBSAN_SANITIZE_ALL
select ARCH_HAS_DEBUG_WX
diff --git a/arch/x86/include/asm/sync_core.h b/arch/x86/include/asm/sync_core.h
index ab7382f92aff..c665b453969a 100644
--- a/arch/x86/include/asm/sync_core.h
+++ b/arch/x86/include/asm/sync_core.h
@@ -89,11 +89,10 @@ static inline void sync_core(void)
}
/*
- * Ensure that a core serializing instruction is issued before returning
- * to user-mode. x86 implements return to user-space through sysexit,
- * sysrel, and sysretq, which are not core serializing.
+ * Ensure that the CPU notices any instruction changes before the next time
+ * it returns to usermode.
*/
-static inline void sync_core_before_usermode(void)
+static inline void membarrier_sync_core_before_usermode(void)
{
/* With PTI, we unconditionally serialize before running user code. */
if (static_cpu_has(X86_FEATURE_PTI))
diff --git a/include/linux/sched/mm.h b/include/linux/sched/mm.h
index 48640db6ca86..81ba47910a73 100644
--- a/include/linux/sched/mm.h
+++ b/include/linux/sched/mm.h
@@ -7,7 +7,6 @@
#include <linux/sched.h>
#include <linux/mm_types.h>
#include <linux/gfp.h>
-#include <linux/sync_core.h>
/*
* Routines for handling mm_structs
diff --git a/include/linux/sync_core.h b/include/linux/sync_core.h
deleted file mode 100644
index 013da4b8b327..000000000000
--- a/include/linux/sync_core.h
+++ /dev/null
@@ -1,21 +0,0 @@
-/* SPDX-License-Identifier: GPL-2.0 */
-#ifndef _LINUX_SYNC_CORE_H
-#define _LINUX_SYNC_CORE_H
-
-#ifdef CONFIG_ARCH_HAS_SYNC_CORE_BEFORE_USERMODE
-#include <asm/sync_core.h>
-#else
-/*
- * This is a dummy sync_core_before_usermode() implementation that can be used
- * on all architectures which return to user-space through core serializing
- * instructions.
- * If your architecture returns to user-space through non-core-serializing
- * instructions, you need to write your own functions.
- */
-static inline void sync_core_before_usermode(void)
-{
-}
-#endif
-
-#endif /* _LINUX_SYNC_CORE_H */
-
diff --git a/init/Kconfig b/init/Kconfig
index c9446911cf41..eb9772078cd4 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -2334,9 +2334,6 @@ source "kernel/Kconfig.locks"
config ARCH_HAS_NON_OVERLAPPING_ADDRESS_SPACE
bool
-config ARCH_HAS_SYNC_CORE_BEFORE_USERMODE
- bool
-
# It may be useful for an architecture to override the definitions of the
# SYSCALL_DEFINE() and __SYSCALL_DEFINEx() macros in <linux/syscalls.h>
# and the COMPAT_ variants in <linux/compat.h>, in particular to use a
diff --git a/kernel/sched/membarrier.c b/kernel/sched/membarrier.c
index b3a82d7635da..db4945e1ec94 100644
--- a/kernel/sched/membarrier.c
+++ b/kernel/sched/membarrier.c
@@ -5,6 +5,9 @@
* membarrier system call
*/
#include "sched.h"
+#ifdef CONFIG_ARCH_HAS_MEMBARRIER_SYNC_CORE
+#include <asm/sync_core.h>
+#endif
/*
* The basic principle behind the regular memory barrier mode of membarrier()
@@ -221,6 +224,7 @@ static void ipi_mb(void *info)
smp_mb(); /* IPIs should be serializing but paranoid. */
}
+#ifdef CONFIG_ARCH_HAS_MEMBARRIER_SYNC_CORE
static void ipi_sync_core(void *info)
{
/*
@@ -230,13 +234,14 @@ static void ipi_sync_core(void *info)
* the big comment at the top of this file.
*
* A sync_core() would provide this guarantee, but
- * sync_core_before_usermode() might end up being deferred until
- * after membarrier()'s smp_mb().
+ * membarrier_sync_core_before_usermode() might end up being deferred
+ * until after membarrier()'s smp_mb().
*/
smp_mb(); /* IPIs should be serializing but paranoid. */
- sync_core_before_usermode();
+ membarrier_sync_core_before_usermode();
}
+#endif
static void ipi_rseq(void *info)
{
@@ -368,12 +373,14 @@ static int membarrier_private_expedited(int flags, int cpu_id)
smp_call_func_t ipi_func = ipi_mb;
if (flags == MEMBARRIER_FLAG_SYNC_CORE) {
- if (!IS_ENABLED(CONFIG_ARCH_HAS_MEMBARRIER_SYNC_CORE))
+#ifndef CONFIG_ARCH_HAS_MEMBARRIER_SYNC_CORE
return -EINVAL;
+#else
if (!(atomic_read(&mm->membarrier_state) &
MEMBARRIER_STATE_PRIVATE_EXPEDITED_SYNC_CORE_READY))
return -EPERM;
ipi_func = ipi_sync_core;
+#endif
} else if (flags == MEMBARRIER_FLAG_RSEQ) {
if (!IS_ENABLED(CONFIG_RSEQ))
return -EINVAL;
--
2.29.2
The patch below does not apply to the 4.14-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
>From 89deb1334252ea4a8491d47654811e28b0790364 Mon Sep 17 00:00:00 2001
From: Jonathan Cameron <Jonathan.Cameron(a)huawei.com>
Date: Sun, 20 Sep 2020 12:27:37 +0100
Subject: [PATCH] iio:magnetometer:mag3110: Fix alignment and data leak issues.
One of a class of bugs pointed out by Lars in a recent review.
iio_push_to_buffers_with_timestamp() assumes the buffer used is aligned
to the size of the timestamp (8 bytes). This is not guaranteed in
this driver which uses an array of smaller elements on the stack.
As Lars also noted this anti pattern can involve a leak of data to
userspace and that indeed can happen here. We close both issues by
moving to a suitable structure in the iio_priv() data.
This data is allocated with kzalloc() so no data can leak apart from
previous readings.
The explicit alignment of ts is not necessary in this case but
does make the code slightly less fragile so I have included it.
Fixes: 39631b5f9584 ("iio: Add Freescale mag3110 magnetometer driver")
Reported-by: Lars-Peter Clausen <lars(a)metafoo.de>
Signed-off-by: Jonathan Cameron <Jonathan.Cameron(a)huawei.com>
Reviewed-by: Alexandru Ardelean <alexandru.ardelean(a)analog.com>
Cc: <Stable(a)vger.kernel.org>
Link: https://lore.kernel.org/r/20200920112742.170751-4-jic23@kernel.org
diff --git a/drivers/iio/magnetometer/mag3110.c b/drivers/iio/magnetometer/mag3110.c
index 838b13c8bb3d..c96415a1aead 100644
--- a/drivers/iio/magnetometer/mag3110.c
+++ b/drivers/iio/magnetometer/mag3110.c
@@ -56,6 +56,12 @@ struct mag3110_data {
int sleep_val;
struct regulator *vdd_reg;
struct regulator *vddio_reg;
+ /* Ensure natural alignment of timestamp */
+ struct {
+ __be16 channels[3];
+ u8 temperature;
+ s64 ts __aligned(8);
+ } scan;
};
static int mag3110_request(struct mag3110_data *data)
@@ -387,10 +393,9 @@ static irqreturn_t mag3110_trigger_handler(int irq, void *p)
struct iio_poll_func *pf = p;
struct iio_dev *indio_dev = pf->indio_dev;
struct mag3110_data *data = iio_priv(indio_dev);
- u8 buffer[16]; /* 3 16-bit channels + 1 byte temp + padding + ts */
int ret;
- ret = mag3110_read(data, (__be16 *) buffer);
+ ret = mag3110_read(data, data->scan.channels);
if (ret < 0)
goto done;
@@ -399,10 +404,10 @@ static irqreturn_t mag3110_trigger_handler(int irq, void *p)
MAG3110_DIE_TEMP);
if (ret < 0)
goto done;
- buffer[6] = ret;
+ data->scan.temperature = ret;
}
- iio_push_to_buffers_with_timestamp(indio_dev, buffer,
+ iio_push_to_buffers_with_timestamp(indio_dev, &data->scan,
iio_get_time_ns(indio_dev));
done:
The patch below does not apply to the 4.19-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
>From 89deb1334252ea4a8491d47654811e28b0790364 Mon Sep 17 00:00:00 2001
From: Jonathan Cameron <Jonathan.Cameron(a)huawei.com>
Date: Sun, 20 Sep 2020 12:27:37 +0100
Subject: [PATCH] iio:magnetometer:mag3110: Fix alignment and data leak issues.
One of a class of bugs pointed out by Lars in a recent review.
iio_push_to_buffers_with_timestamp() assumes the buffer used is aligned
to the size of the timestamp (8 bytes). This is not guaranteed in
this driver which uses an array of smaller elements on the stack.
As Lars also noted this anti pattern can involve a leak of data to
userspace and that indeed can happen here. We close both issues by
moving to a suitable structure in the iio_priv() data.
This data is allocated with kzalloc() so no data can leak apart from
previous readings.
The explicit alignment of ts is not necessary in this case but
does make the code slightly less fragile so I have included it.
Fixes: 39631b5f9584 ("iio: Add Freescale mag3110 magnetometer driver")
Reported-by: Lars-Peter Clausen <lars(a)metafoo.de>
Signed-off-by: Jonathan Cameron <Jonathan.Cameron(a)huawei.com>
Reviewed-by: Alexandru Ardelean <alexandru.ardelean(a)analog.com>
Cc: <Stable(a)vger.kernel.org>
Link: https://lore.kernel.org/r/20200920112742.170751-4-jic23@kernel.org
diff --git a/drivers/iio/magnetometer/mag3110.c b/drivers/iio/magnetometer/mag3110.c
index 838b13c8bb3d..c96415a1aead 100644
--- a/drivers/iio/magnetometer/mag3110.c
+++ b/drivers/iio/magnetometer/mag3110.c
@@ -56,6 +56,12 @@ struct mag3110_data {
int sleep_val;
struct regulator *vdd_reg;
struct regulator *vddio_reg;
+ /* Ensure natural alignment of timestamp */
+ struct {
+ __be16 channels[3];
+ u8 temperature;
+ s64 ts __aligned(8);
+ } scan;
};
static int mag3110_request(struct mag3110_data *data)
@@ -387,10 +393,9 @@ static irqreturn_t mag3110_trigger_handler(int irq, void *p)
struct iio_poll_func *pf = p;
struct iio_dev *indio_dev = pf->indio_dev;
struct mag3110_data *data = iio_priv(indio_dev);
- u8 buffer[16]; /* 3 16-bit channels + 1 byte temp + padding + ts */
int ret;
- ret = mag3110_read(data, (__be16 *) buffer);
+ ret = mag3110_read(data, data->scan.channels);
if (ret < 0)
goto done;
@@ -399,10 +404,10 @@ static irqreturn_t mag3110_trigger_handler(int irq, void *p)
MAG3110_DIE_TEMP);
if (ret < 0)
goto done;
- buffer[6] = ret;
+ data->scan.temperature = ret;
}
- iio_push_to_buffers_with_timestamp(indio_dev, buffer,
+ iio_push_to_buffers_with_timestamp(indio_dev, &data->scan,
iio_get_time_ns(indio_dev));
done:
With Paul, we've been thinking that the idle loop wasn't twisted enough
yet to deserve 2020.
rcutorture, after some recent parameter changes, has been complaining
about a hung task.
It appears that rcu_idle_enter() may wake up a NOCB kthread but this
happens after the last generic need_resched() check. Some cpuidle drivers
fix it by chance but many others don't.
Here is a proposed bunch of fixes. I will need to also fix the
rcu_user_enter() case, likely using irq_work, since nohz_full requires
irq work to support self IPI.
Also more generally, this raise the question of local task wake_up()
under disabled interrupts. When a wake up occurs in a preempt disabled
section, it gets handled by the outer preempt_enable() call. There is no
similar mechanism when a wake up occurs with interrupts disabled. I guess
it is assumed to be handled, at worst, in the next tick. But a local irq
work would provide instant preemption once IRQs are re-enabled. Of course
this would only make sense in CONFIG_PREEMPTION, and when the tick is
disabled...
git://git.kernel.org/pub/scm/linux/kernel/git/frederic/linux-dynticks.git
sched/idle
HEAD: f2fa6e4a070c1535b9edc9ee097167fd2b15d235
Thanks,
Frederic
---
Frederic Weisbecker (4):
sched/idle: Fix missing need_resched() check after rcu_idle_enter()
cpuidle: Fix missing need_resched() check after rcu_idle_enter()
ARM: imx6q: Fix missing need_resched() check after rcu_idle_enter()
ACPI: processor: Fix missing need_resched() check after rcu_idle_enter()
arch/arm/mach-imx/cpuidle-imx6q.c | 7 ++++++-
drivers/acpi/processor_idle.c | 10 ++++++++--
drivers/cpuidle/cpuidle.c | 33 +++++++++++++++++++++++++--------
kernel/sched/idle.c | 18 ++++++++++++------
4 files changed, 51 insertions(+), 17 deletions(-)
The patch below does not apply to the 4.14-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
>From 7b6b51234df6cd8b04fe736b0b89c25612d896b8 Mon Sep 17 00:00:00 2001
From: Jonathan Cameron <Jonathan.Cameron(a)huawei.com>
Date: Sun, 20 Sep 2020 12:27:39 +0100
Subject: [PATCH] iio:imu:bmi160: Fix alignment and data leak issues
One of a class of bugs pointed out by Lars in a recent review.
iio_push_to_buffers_with_timestamp assumes the buffer used is aligned
to the size of the timestamp (8 bytes). This is not guaranteed in
this driver which uses an array of smaller elements on the stack.
As Lars also noted this anti pattern can involve a leak of data to
userspace and that indeed can happen here. We close both issues by
moving to a suitable array in the iio_priv() data with alignment
explicitly requested. This data is allocated with kzalloc() so no
data can leak apart from previous readings.
In this driver, depending on which channels are enabled, the timestamp
can be in a number of locations. Hence we cannot use a structure
to specify the data layout without it being misleading.
Fixes: 77c4ad2d6a9b ("iio: imu: Add initial support for Bosch BMI160")
Reported-by: Lars-Peter Clausen <lars(a)metafoo.de>
Signed-off-by: Jonathan Cameron <Jonathan.Cameron(a)huawei.com>
Reviewed-by: Alexandru Ardelean <alexandru.ardelean(a)analog.com>
Cc: Daniel Baluta <daniel.baluta(a)gmail.com>
Cc: Daniel Baluta <daniel.baluta(a)oss.nxp.com>
Cc: <Stable(a)vger.kernel.org>
Link: https://lore.kernel.org/r/20200920112742.170751-6-jic23@kernel.org
diff --git a/drivers/iio/imu/bmi160/bmi160.h b/drivers/iio/imu/bmi160/bmi160.h
index a82e040bd109..32c2ea2d7112 100644
--- a/drivers/iio/imu/bmi160/bmi160.h
+++ b/drivers/iio/imu/bmi160/bmi160.h
@@ -10,6 +10,13 @@ struct bmi160_data {
struct iio_trigger *trig;
struct regulator_bulk_data supplies[2];
struct iio_mount_matrix orientation;
+ /*
+ * Ensure natural alignment for timestamp if present.
+ * Max length needed: 2 * 3 channels + 4 bytes padding + 8 byte ts.
+ * If fewer channels are enabled, less space may be needed, as
+ * long as the timestamp is still aligned to 8 bytes.
+ */
+ __le16 buf[12] __aligned(8);
};
extern const struct regmap_config bmi160_regmap_config;
diff --git a/drivers/iio/imu/bmi160/bmi160_core.c b/drivers/iio/imu/bmi160/bmi160_core.c
index c8e131c29043..290b5ef83f77 100644
--- a/drivers/iio/imu/bmi160/bmi160_core.c
+++ b/drivers/iio/imu/bmi160/bmi160_core.c
@@ -427,8 +427,6 @@ static irqreturn_t bmi160_trigger_handler(int irq, void *p)
struct iio_poll_func *pf = p;
struct iio_dev *indio_dev = pf->indio_dev;
struct bmi160_data *data = iio_priv(indio_dev);
- __le16 buf[12];
- /* 2 sens x 3 axis x __le16 + 2 x __le16 pad + 4 x __le16 tstamp */
int i, ret, j = 0, base = BMI160_REG_DATA_MAGN_XOUT_L;
__le16 sample;
@@ -438,10 +436,10 @@ static irqreturn_t bmi160_trigger_handler(int irq, void *p)
&sample, sizeof(sample));
if (ret)
goto done;
- buf[j++] = sample;
+ data->buf[j++] = sample;
}
- iio_push_to_buffers_with_timestamp(indio_dev, buf, pf->timestamp);
+ iio_push_to_buffers_with_timestamp(indio_dev, data->buf, pf->timestamp);
done:
iio_trigger_notify_done(indio_dev->trig);
return IRQ_HANDLED;
The patch below does not apply to the 4.19-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
>From 7b6b51234df6cd8b04fe736b0b89c25612d896b8 Mon Sep 17 00:00:00 2001
From: Jonathan Cameron <Jonathan.Cameron(a)huawei.com>
Date: Sun, 20 Sep 2020 12:27:39 +0100
Subject: [PATCH] iio:imu:bmi160: Fix alignment and data leak issues
One of a class of bugs pointed out by Lars in a recent review.
iio_push_to_buffers_with_timestamp assumes the buffer used is aligned
to the size of the timestamp (8 bytes). This is not guaranteed in
this driver which uses an array of smaller elements on the stack.
As Lars also noted this anti pattern can involve a leak of data to
userspace and that indeed can happen here. We close both issues by
moving to a suitable array in the iio_priv() data with alignment
explicitly requested. This data is allocated with kzalloc() so no
data can leak apart from previous readings.
In this driver, depending on which channels are enabled, the timestamp
can be in a number of locations. Hence we cannot use a structure
to specify the data layout without it being misleading.
Fixes: 77c4ad2d6a9b ("iio: imu: Add initial support for Bosch BMI160")
Reported-by: Lars-Peter Clausen <lars(a)metafoo.de>
Signed-off-by: Jonathan Cameron <Jonathan.Cameron(a)huawei.com>
Reviewed-by: Alexandru Ardelean <alexandru.ardelean(a)analog.com>
Cc: Daniel Baluta <daniel.baluta(a)gmail.com>
Cc: Daniel Baluta <daniel.baluta(a)oss.nxp.com>
Cc: <Stable(a)vger.kernel.org>
Link: https://lore.kernel.org/r/20200920112742.170751-6-jic23@kernel.org
diff --git a/drivers/iio/imu/bmi160/bmi160.h b/drivers/iio/imu/bmi160/bmi160.h
index a82e040bd109..32c2ea2d7112 100644
--- a/drivers/iio/imu/bmi160/bmi160.h
+++ b/drivers/iio/imu/bmi160/bmi160.h
@@ -10,6 +10,13 @@ struct bmi160_data {
struct iio_trigger *trig;
struct regulator_bulk_data supplies[2];
struct iio_mount_matrix orientation;
+ /*
+ * Ensure natural alignment for timestamp if present.
+ * Max length needed: 2 * 3 channels + 4 bytes padding + 8 byte ts.
+ * If fewer channels are enabled, less space may be needed, as
+ * long as the timestamp is still aligned to 8 bytes.
+ */
+ __le16 buf[12] __aligned(8);
};
extern const struct regmap_config bmi160_regmap_config;
diff --git a/drivers/iio/imu/bmi160/bmi160_core.c b/drivers/iio/imu/bmi160/bmi160_core.c
index c8e131c29043..290b5ef83f77 100644
--- a/drivers/iio/imu/bmi160/bmi160_core.c
+++ b/drivers/iio/imu/bmi160/bmi160_core.c
@@ -427,8 +427,6 @@ static irqreturn_t bmi160_trigger_handler(int irq, void *p)
struct iio_poll_func *pf = p;
struct iio_dev *indio_dev = pf->indio_dev;
struct bmi160_data *data = iio_priv(indio_dev);
- __le16 buf[12];
- /* 2 sens x 3 axis x __le16 + 2 x __le16 pad + 4 x __le16 tstamp */
int i, ret, j = 0, base = BMI160_REG_DATA_MAGN_XOUT_L;
__le16 sample;
@@ -438,10 +436,10 @@ static irqreturn_t bmi160_trigger_handler(int irq, void *p)
&sample, sizeof(sample));
if (ret)
goto done;
- buf[j++] = sample;
+ data->buf[j++] = sample;
}
- iio_push_to_buffers_with_timestamp(indio_dev, buf, pf->timestamp);
+ iio_push_to_buffers_with_timestamp(indio_dev, data->buf, pf->timestamp);
done:
iio_trigger_notify_done(indio_dev->trig);
return IRQ_HANDLED;
The patch below does not apply to the 5.4-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
>From 7b6b51234df6cd8b04fe736b0b89c25612d896b8 Mon Sep 17 00:00:00 2001
From: Jonathan Cameron <Jonathan.Cameron(a)huawei.com>
Date: Sun, 20 Sep 2020 12:27:39 +0100
Subject: [PATCH] iio:imu:bmi160: Fix alignment and data leak issues
One of a class of bugs pointed out by Lars in a recent review.
iio_push_to_buffers_with_timestamp assumes the buffer used is aligned
to the size of the timestamp (8 bytes). This is not guaranteed in
this driver which uses an array of smaller elements on the stack.
As Lars also noted this anti pattern can involve a leak of data to
userspace and that indeed can happen here. We close both issues by
moving to a suitable array in the iio_priv() data with alignment
explicitly requested. This data is allocated with kzalloc() so no
data can leak apart from previous readings.
In this driver, depending on which channels are enabled, the timestamp
can be in a number of locations. Hence we cannot use a structure
to specify the data layout without it being misleading.
Fixes: 77c4ad2d6a9b ("iio: imu: Add initial support for Bosch BMI160")
Reported-by: Lars-Peter Clausen <lars(a)metafoo.de>
Signed-off-by: Jonathan Cameron <Jonathan.Cameron(a)huawei.com>
Reviewed-by: Alexandru Ardelean <alexandru.ardelean(a)analog.com>
Cc: Daniel Baluta <daniel.baluta(a)gmail.com>
Cc: Daniel Baluta <daniel.baluta(a)oss.nxp.com>
Cc: <Stable(a)vger.kernel.org>
Link: https://lore.kernel.org/r/20200920112742.170751-6-jic23@kernel.org
diff --git a/drivers/iio/imu/bmi160/bmi160.h b/drivers/iio/imu/bmi160/bmi160.h
index a82e040bd109..32c2ea2d7112 100644
--- a/drivers/iio/imu/bmi160/bmi160.h
+++ b/drivers/iio/imu/bmi160/bmi160.h
@@ -10,6 +10,13 @@ struct bmi160_data {
struct iio_trigger *trig;
struct regulator_bulk_data supplies[2];
struct iio_mount_matrix orientation;
+ /*
+ * Ensure natural alignment for timestamp if present.
+ * Max length needed: 2 * 3 channels + 4 bytes padding + 8 byte ts.
+ * If fewer channels are enabled, less space may be needed, as
+ * long as the timestamp is still aligned to 8 bytes.
+ */
+ __le16 buf[12] __aligned(8);
};
extern const struct regmap_config bmi160_regmap_config;
diff --git a/drivers/iio/imu/bmi160/bmi160_core.c b/drivers/iio/imu/bmi160/bmi160_core.c
index c8e131c29043..290b5ef83f77 100644
--- a/drivers/iio/imu/bmi160/bmi160_core.c
+++ b/drivers/iio/imu/bmi160/bmi160_core.c
@@ -427,8 +427,6 @@ static irqreturn_t bmi160_trigger_handler(int irq, void *p)
struct iio_poll_func *pf = p;
struct iio_dev *indio_dev = pf->indio_dev;
struct bmi160_data *data = iio_priv(indio_dev);
- __le16 buf[12];
- /* 2 sens x 3 axis x __le16 + 2 x __le16 pad + 4 x __le16 tstamp */
int i, ret, j = 0, base = BMI160_REG_DATA_MAGN_XOUT_L;
__le16 sample;
@@ -438,10 +436,10 @@ static irqreturn_t bmi160_trigger_handler(int irq, void *p)
&sample, sizeof(sample));
if (ret)
goto done;
- buf[j++] = sample;
+ data->buf[j++] = sample;
}
- iio_push_to_buffers_with_timestamp(indio_dev, buf, pf->timestamp);
+ iio_push_to_buffers_with_timestamp(indio_dev, data->buf, pf->timestamp);
done:
iio_trigger_notify_done(indio_dev->trig);
return IRQ_HANDLED;
This reverts commit a135a1b4c4db1f3b8cbed9676a40ede39feb3362.
This leads to blank screens on some boards after replugging a
display. Revert until we understand the root cause and can
fix both the leak and the blank screen after replug.
Cc: Stylon Wang <stylon.wang(a)amd.com>
Cc: Harry Wentland <harry.wentland(a)amd.com>
Cc: Nicholas Kazlauskas <nicholas.kazlauskas(a)amd.com>
Cc: Andre Tomt <andre(a)tomt.net>
Cc: Oleksandr Natalenko <oleksandr(a)natalenko.name>
Signed-off-by: Alex Deucher <alexander.deucher(a)amd.com>
Cc: stable(a)vger.kernel.org
---
drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c | 3 +--
1 file changed, 1 insertion(+), 2 deletions(-)
diff --git a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
index 0d2e334be87a..318eb12f8de7 100644
--- a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
+++ b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
@@ -2385,8 +2385,7 @@ void amdgpu_dm_update_connector_after_detect(
drm_connector_update_edid_property(connector,
aconnector->edid);
- aconnector->num_modes = drm_add_edid_modes(connector, aconnector->edid);
- drm_connector_list_update(connector);
+ drm_add_edid_modes(connector, aconnector->edid);
if (aconnector->dc_link->aux_mode)
drm_dp_cec_set_edid(&aconnector->dm_dp_aux.aux,
--
2.29.2
We received a slab-out-of-bounds report in ip6_xmit() for KASAN build on 4.9
kernel. The patches that fix this issue have been backported to to stable 4.14
and one of them even to 3.16 but 4.9 stable branch does not include them.
Backport to linux-4.9.y required trivial merge conflict resolution. They
cleanly apply to linux-stable linux-4.9.y branch tagged v4.9.249.
Paolo Abeni (2):
net: ipv6: keep sk status consistent after datagram connect failure
l2tp: fix races with ipv4-mapped ipv6 addresses
net/ipv6/datagram.c | 21 +++++++++++++++++----
net/l2tp/l2tp_core.c | 38 ++++++++++++++++++--------------------
net/l2tp/l2tp_core.h | 3 ---
3 files changed, 35 insertions(+), 27 deletions(-)
--
2.29.2.729.g45daf8777d-goog
Commit 28ee90fe6048 ("x86/mm: implement free pmd/pte page interfaces")
introduced a new location where a pmd was released, but neglected to run
the pmd page destructor. In fact, this happened previously for a
different pmd release path and was fixed by commit:
c283610e44ec ("x86, mm: do not leak page->ptl for pmd page tables").
This issue was hidden until recently because the failure mode is silent,
but commit:
b2b29d6d0119 ("mm: account PMD tables like PTE tables")
...turns the failure mode into this signature:
BUG: Bad page state in process lt-pmem-ns pfn:15943d
page:000000007262ed7b refcount:0 mapcount:-1024 mapping:0000000000000000 index:0x0 pfn:0x15943d
flags: 0xaffff800000000()
raw: 00affff800000000 dead000000000100 0000000000000000 0000000000000000
raw: 0000000000000000 ffff913a029bcc08 00000000fffffbff 0000000000000000
page dumped because: nonzero mapcount
[..]
dump_stack+0x8b/0xb0
bad_page.cold+0x63/0x94
free_pcp_prepare+0x224/0x270
free_unref_page+0x18/0xd0
pud_free_pmd_page+0x146/0x160
ioremap_pud_range+0xe3/0x350
ioremap_page_range+0x108/0x160
__ioremap_caller.constprop.0+0x174/0x2b0
? memremap+0x7a/0x110
memremap+0x7a/0x110
devm_memremap+0x53/0xa0
pmem_attach_disk+0x4ed/0x530 [nd_pmem]
? __devm_release_region+0x52/0x80
nvdimm_bus_probe+0x85/0x210 [libnvdimm]
Given this is a repeat occurrence it seemed prudent to look for other
places where this destructor might be missing and whether a better
helper is needed. try_to_free_pmd_page() looks like a candidate, but
testing with setting up and tearing down pmd mappings via the dax unit
tests is thus far not triggering the failure. As for a better helper
pmd_free() is close, but it is a messy fit due to requiring an @mm arg.
Also, ___pmd_free_tlb() wants to call paravirt_tlb_remove_table()
instead of free_page(), so open-coded pgtable_pmd_page_dtor() seems the
best way forward for now.
Fixes: 28ee90fe6048 ("x86/mm: implement free pmd/pte page interfaces")
Cc: <stable(a)vger.kernel.org>
Cc: Dave Hansen <dave.hansen(a)linux.intel.com>
Cc: Andy Lutomirski <luto(a)kernel.org>
Cc: Peter Zijlstra <peterz(a)infradead.org>
Cc: Thomas Gleixner <tglx(a)linutronix.de>
Cc: Ingo Molnar <mingo(a)redhat.com>
Cc: Borislav Petkov <bp(a)alien8.de>
Cc: x86(a)kernel.org
Cc: "H. Peter Anvin" <hpa(a)zytor.com>
Co-debugged-by: Matthew Wilcox <willy(a)infradead.org>
Tested-by: Yi Zhang <yi.zhang(a)redhat.com>
Signed-off-by: Dan Williams <dan.j.williams(a)intel.com>
---
arch/x86/mm/pgtable.c | 2 ++
1 file changed, 2 insertions(+)
diff --git a/arch/x86/mm/pgtable.c b/arch/x86/mm/pgtable.c
index dfd82f51ba66..f6a9e2e36642 100644
--- a/arch/x86/mm/pgtable.c
+++ b/arch/x86/mm/pgtable.c
@@ -829,6 +829,8 @@ int pud_free_pmd_page(pud_t *pud, unsigned long addr)
}
free_page((unsigned long)pmd_sv);
+
+ pgtable_pmd_page_dtor(virt_to_page(pmd));
free_page((unsigned long)pmd);
return 1;
It is observed 'use-after-free' on the dmabuf's file->f_inode with the
race between closing the dmabuf file and reading the dmabuf's debug
info.
Consider the below scenario where P1 is closing the dma_buf file
and P2 is reading the dma_buf's debug info in the system:
P1 P2
dma_buf_debug_show()
dma_buf_put()
__fput()
file->f_op->release()
dput()
....
dentry_unlink_inode()
iput(dentry->d_inode)
(where the inode is freed)
mutex_lock(&db_list.lock)
read 'dma_buf->file->f_inode'
(the same inode is freed by P1)
mutex_unlock(&db_list.lock)
dentry->d_op->d_release()-->
dma_buf_release()
.....
mutex_lock(&db_list.lock)
removes the dmabuf from the list
mutex_unlock(&db_list.lock)
In the above scenario, when dma_buf_put() is called on a dma_buf, it
first frees the dma_buf's file->f_inode(=dentry->d_inode) and then
removes this dma_buf from the system db_list. In between P2 traversing
the db_list tries to access this dma_buf's file->f_inode that was freed
by P1 which is a use-after-free case.
Since, __fput() calls f_op->release first and then later calls the
d_op->d_release, move the dma_buf's db_list removal from d_release() to
f_op->release(). This ensures that dma_buf's file->f_inode is not
accessed after it is released.
Cc: <stable(a)vger.kernel.org> # 5.4+
Fixes: 4ab59c3c638c ("dma-buf: Move dma_buf_release() from fops to dentry_ops")
Acked-by: Christian König <christian.koenig(a)amd.com>
Signed-off-by: Charan Teja Reddy <charante(a)codeaurora.org>
---
V2: Resending with stable tags and Acks
V1: https://lore.kernel.org/patchwork/patch/1360118/
drivers/dma-buf/dma-buf.c | 21 +++++++++++++++++----
1 file changed, 17 insertions(+), 4 deletions(-)
diff --git a/drivers/dma-buf/dma-buf.c b/drivers/dma-buf/dma-buf.c
index 0eb80c1..a14dcbb 100644
--- a/drivers/dma-buf/dma-buf.c
+++ b/drivers/dma-buf/dma-buf.c
@@ -76,10 +76,6 @@ static void dma_buf_release(struct dentry *dentry)
dmabuf->ops->release(dmabuf);
- mutex_lock(&db_list.lock);
- list_del(&dmabuf->list_node);
- mutex_unlock(&db_list.lock);
-
if (dmabuf->resv == (struct dma_resv *)&dmabuf[1])
dma_resv_fini(dmabuf->resv);
@@ -88,6 +84,22 @@ static void dma_buf_release(struct dentry *dentry)
kfree(dmabuf);
}
+static int dma_buf_file_release(struct inode *inode, struct file *file)
+{
+ struct dma_buf *dmabuf;
+
+ if (!is_dma_buf_file(file))
+ return -EINVAL;
+
+ dmabuf = file->private_data;
+
+ mutex_lock(&db_list.lock);
+ list_del(&dmabuf->list_node);
+ mutex_unlock(&db_list.lock);
+
+ return 0;
+}
+
static const struct dentry_operations dma_buf_dentry_ops = {
.d_dname = dmabuffs_dname,
.d_release = dma_buf_release,
@@ -413,6 +425,7 @@ static void dma_buf_show_fdinfo(struct seq_file *m, struct file *file)
}
static const struct file_operations dma_buf_fops = {
+ .release = dma_buf_file_release,
.mmap = dma_buf_mmap_internal,
.llseek = dma_buf_llseek,
.poll = dma_buf_poll,
--
QUALCOMM INDIA, on behalf of Qualcomm Innovation Center, Inc. is a
member of the Code Aurora Forum, hosted by The Linux Foundation
If an active transfer is dequeued, then the endpoint is freed to start a
new transfer. Make sure to clear the endpoint's transfer wait flag for
this case.
Cc: stable(a)vger.kernel.org
Fixes: e0d19563eb6c ("usb: dwc3: gadget: Wait for transfer completion")
Signed-off-by: Thinh Nguyen <Thinh.Nguyen(a)synopsys.com>
---
Changes in v2:
- Only clear the wait flag if the selected request is of an active transfer.
Otherwise, any dequeue will change the endpoint's state even if it's for
some random request.
drivers/usb/dwc3/gadget.c | 2 ++
1 file changed, 2 insertions(+)
diff --git a/drivers/usb/dwc3/gadget.c b/drivers/usb/dwc3/gadget.c
index 78cb4db8a6e4..9a00dcaca010 100644
--- a/drivers/usb/dwc3/gadget.c
+++ b/drivers/usb/dwc3/gadget.c
@@ -1763,6 +1763,8 @@ static int dwc3_gadget_ep_dequeue(struct usb_ep *ep,
list_for_each_entry_safe(r, t, &dep->started_list, list)
dwc3_gadget_move_cancelled_request(r);
+ dep->flags &= ~DWC3_EP_WAIT_TRANSFER_COMPLETE;
+
goto out;
}
}
base-commit: 2edc7af892d0913bf06f5b35e49ec463f03d5ed8
--
2.28.0
Here's another variant PNY Pro Elite USB 3.1 Gen 2 portable SSD that
hangs and doesn't respond to ATA_1x pass-through commands. If it doesn't
support these commands, it should respond properly to the host. Add it
to the unusual uas list to be able to move forward with other
operations.
Cc: stable(a)vger.kernel.org
Signed-off-by: Thinh Nguyen <Thinh.Nguyen(a)synopsys.com>
---
drivers/usb/storage/unusual_uas.h | 7 +++++++
1 file changed, 7 insertions(+)
diff --git a/drivers/usb/storage/unusual_uas.h b/drivers/usb/storage/unusual_uas.h
index 870e9cf3d5dc..f9677a5ec31b 100644
--- a/drivers/usb/storage/unusual_uas.h
+++ b/drivers/usb/storage/unusual_uas.h
@@ -90,6 +90,13 @@ UNUSUAL_DEV(0x152d, 0x0578, 0x0000, 0x9999,
USB_SC_DEVICE, USB_PR_DEVICE, NULL,
US_FL_BROKEN_FUA),
+/* Reported-by: Thinh Nguyen <thinhn(a)synopsys.com> */
+UNUSUAL_DEV(0x154b, 0xf00b, 0x0000, 0x9999,
+ "PNY",
+ "Pro Elite SSD",
+ USB_SC_DEVICE, USB_PR_DEVICE, NULL,
+ US_FL_NO_ATA_1X),
+
/* Reported-by: Thinh Nguyen <thinhn(a)synopsys.com> */
UNUSUAL_DEV(0x154b, 0xf00d, 0x0000, 0x9999,
"PNY",
base-commit: 5c8fe583cce542aa0b84adc939ce85293de36e5e
--
2.28.0
With CONFIG_EXPERT=y, CONFIG_KASAN=y, CONFIG_RANDOMIZE_BASE=n,
CONFIG_RELOCATABLE=n, we observe the following failure when trying to
link the kernel image with LD=ld.lld:
error: section: .exit.data is not contiguous with other relro sections
ld.lld defaults to -z relro while ld.bfd defaults to -z norelro. This
was previously fixed, but only for CONFIG_RELOCATABLE=y.
Cc: stable(a)vger.kernel.org
Fixes: commit 3bbd3db86470 ("arm64: relocatable: fix inconsistencies in linker script and options")
Signed-off-by: Nick Desaulniers <ndesaulniers(a)google.com>
---
While upgrading our toolchains for Android, we started seeing the above
failure for a particular config that enabled KASAN but disabled KASLR.
This was on a 5.4 stable branch. It looks like
commit dd4bc6076587 ("arm64: warn on incorrect placement of the kernel by the bootloader")
made RELOCATABLE=y the default and depend on EXPERT=y. With those two
enabled, we can then reproduce the same failure on mainline.
arch/arm64/Makefile | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/arch/arm64/Makefile b/arch/arm64/Makefile
index f4717facf31e..674241df91ab 100644
--- a/arch/arm64/Makefile
+++ b/arch/arm64/Makefile
@@ -10,13 +10,13 @@
#
# Copyright (C) 1995-2001 by Russell King
-LDFLAGS_vmlinux :=--no-undefined -X
+LDFLAGS_vmlinux :=--no-undefined -X -z norelro
ifeq ($(CONFIG_RELOCATABLE), y)
# Pass --no-apply-dynamic-relocs to restore pre-binutils-2.27 behaviour
# for relative relocs, since this leads to better Image compression
# with the relocation offsets always being zero.
-LDFLAGS_vmlinux += -shared -Bsymbolic -z notext -z norelro \
+LDFLAGS_vmlinux += -shared -Bsymbolic -z notext \
$(call ld-option, --no-apply-dynamic-relocs)
endif
--
2.29.0.rc1.297.gfa9743e501-goog
This reverts stable commit baad618d078c857f99cc286ea249e9629159901f.
This commit is adding lines to spinand_write_to_cache_op, wheras the upstream
commit 868cbe2a6dcee451bd8f87cbbb2a73cf463b57e5 that this was supposed to
backport was touching spinand_read_from_cache_op.
It causes a crash on writing OOB data by attempting to write to read-only
kernel memory.
Cc: Miquel Raynal <miquel.raynal(a)bootlin.com>
Signed-off-by: Felix Fietkau <nbd(a)nbd.name>
---
drivers/mtd/nand/spi/core.c | 4 ----
1 file changed, 4 deletions(-)
diff --git a/drivers/mtd/nand/spi/core.c b/drivers/mtd/nand/spi/core.c
index 7900571fc85b..c35221794645 100644
--- a/drivers/mtd/nand/spi/core.c
+++ b/drivers/mtd/nand/spi/core.c
@@ -318,10 +318,6 @@ static int spinand_write_to_cache_op(struct spinand_device *spinand,
buf += ret;
}
- if (req->ooblen)
- memcpy(req->oobbuf.in, spinand->oobbuf + req->ooboffs,
- req->ooblen);
-
return 0;
}
--
2.28.0
Correctly handle the MVPG instruction when issued by a VSIE guest.
Fixes: a3508fbe9dc6d ("KVM: s390: vsie: initial support for nested virtualization")
Cc: stable(a)vger.kernel.org
Signed-off-by: Claudio Imbrenda <imbrenda(a)linux.ibm.com>
---
arch/s390/kvm/vsie.c | 73 ++++++++++++++++++++++++++++++++++++++++++++
1 file changed, 73 insertions(+)
diff --git a/arch/s390/kvm/vsie.c b/arch/s390/kvm/vsie.c
index ada49583e530..6c3069868acd 100644
--- a/arch/s390/kvm/vsie.c
+++ b/arch/s390/kvm/vsie.c
@@ -977,6 +977,75 @@ static int handle_stfle(struct kvm_vcpu *vcpu, struct vsie_page *vsie_page)
return 0;
}
+static u64 vsie_get_register(struct kvm_vcpu *vcpu, struct vsie_page *vsie_page, u8 reg)
+{
+ reg &= 0xf;
+ switch (reg) {
+ case 15:
+ return vsie_page->scb_s.gg15;
+ case 14:
+ return vsie_page->scb_s.gg14;
+ default:
+ return vcpu->run->s.regs.gprs[reg];
+ }
+}
+
+static int vsie_handle_mvpg(struct kvm_vcpu *vcpu, struct vsie_page *vsie_page)
+{
+ struct kvm_s390_sie_block *scb_s = &vsie_page->scb_s;
+ unsigned long r1, r2, mask = PAGE_MASK;
+ int rc;
+
+ if (psw_bits(scb_s->gpsw).eaba == PSW_BITS_AMODE_24BIT)
+ mask = 0xfff000;
+ else if (psw_bits(scb_s->gpsw).eaba == PSW_BITS_AMODE_31BIT)
+ mask = 0x7ffff000;
+
+ r1 = vsie_get_register(vcpu, vsie_page, scb_s->ipb >> 20) & mask;
+ r2 = vsie_get_register(vcpu, vsie_page, scb_s->ipb >> 16) & mask;
+ rc = kvm_s390_vsie_mvpg_check(vcpu, r1, r2, &vsie_page->scb_o->mcic);
+
+ /*
+ * Guest translation was not successful. The host needs to forward
+ * the intercept to the guest and let the guest fix its page tables.
+ * The guest needs then to retry the instruction.
+ */
+ if (rc == -ENOENT)
+ return 1;
+
+ retry_vsie_icpt(vsie_page);
+
+ /*
+ * Guest translation was not successful. The page tables of the guest
+ * are broken. Try again and let the hardware deliver the fault.
+ */
+ if (rc == -EFAULT)
+ return 0;
+
+ /*
+ * Guest translation was successful. The host needs to fix up its
+ * page tables and retry the instruction in the nested guest.
+ * In case of failure, the instruction will intercept again, and
+ * a different path will be taken.
+ */
+ if (!rc) {
+ kvm_s390_shadow_fault(vcpu, vsie_page->gmap, r2);
+ kvm_s390_shadow_fault(vcpu, vsie_page->gmap, r1);
+ return 0;
+ }
+
+ /*
+ * An exception happened during guest translation, it needs to be
+ * delivered to the guest. This can happen if the host has EDAT1
+ * enabled and the guest has not, or for other causes. The guest
+ * needs to process the exception and return to the nested guest.
+ */
+ if (rc > 0)
+ return kvm_s390_inject_prog_cond(vcpu, rc);
+
+ return 1;
+}
+
/*
* Run the vsie on a shadow scb and a shadow gmap, without any further
* sanity checks, handling SIE faults.
@@ -1063,6 +1132,10 @@ static int do_vsie_run(struct kvm_vcpu *vcpu, struct vsie_page *vsie_page)
if ((scb_s->ipa & 0xf000) != 0xf000)
scb_s->ipa += 0x1000;
break;
+ case ICPT_PARTEXEC:
+ if (scb_s->ipa == 0xb254)
+ rc = vsie_handle_mvpg(vcpu, vsie_page);
+ break;
}
return rc;
}
--
2.26.2
The TLB flush optimisation (a46cc7a90f: powerpc/mm/radix: Improve TLB/PWC
flushes) may result in random memory corruption. Any concurrent page-table walk
could end up with a Use-after-Free. Even on UP this might give issues, since
mmu_gather is preemptible these days. An interrupt or preempted task accessing
user pages might stumble into the free page if the hardware caches page
directories.
The series is a backport of the fix sent by Peter [1].
The first three patches are dependencies for the last patch (avoid potential
double flush). If the performance impact due to double flush is considered
trivial then the first three patches and last patch may be dropped.
This is only for v4.19 stable.
[1] https://patchwork.kernel.org/cover/11284843/
--
Changelog:
v2: Send the patches with the correct format (commit sha1 upstream) for stable
v3: Fix compilation issue on ppc40x_defconfig and ppc44x_defconfig
--
Aneesh Kumar K.V (1):
powerpc/mmu_gather: enable RCU_TABLE_FREE even for !SMP case
Peter Zijlstra (4):
asm-generic/tlb: Track freeing of page-table directories in struct
mmu_gather
asm-generic/tlb, arch: Invert CONFIG_HAVE_RCU_TABLE_INVALIDATE
mm/mmu_gather: invalidate TLB correctly on batch allocation failure
and flush
asm-generic/tlb: avoid potential double flush
Will Deacon (1):
asm-generic/tlb: Track which levels of the page tables have been
cleared
arch/Kconfig | 3 -
arch/powerpc/Kconfig | 2 +-
arch/powerpc/include/asm/book3s/32/pgalloc.h | 8 --
arch/powerpc/include/asm/book3s/64/pgalloc.h | 2 -
arch/powerpc/include/asm/nohash/32/pgalloc.h | 8 --
arch/powerpc/include/asm/nohash/64/pgalloc.h | 9 +-
arch/powerpc/include/asm/tlb.h | 11 ++
arch/powerpc/mm/pgtable-book3s64.c | 7 --
arch/sparc/include/asm/tlb_64.h | 9 ++
arch/x86/Kconfig | 1 -
include/asm-generic/tlb.h | 103 ++++++++++++++++---
mm/memory.c | 20 ++--
12 files changed, 123 insertions(+), 60 deletions(-)
--
2.24.1
ksys_umount was refactored to into split into another function
(path_umount) to enable sharing code. This changed the order that flags and
permissions are validated in, and made it so that user_path_at was called
before validating flags.
Unfortunately, libfuse2[1] and libmount[2] rely on the old flag validation
behaviour to determine whether or not the kernel supports UMOUNT_NOFOLLOW.
The other path that this validation is being checked on is
init_umount->path_umount->can_umount. That's all internal to the kernel. We
can safely move flag checking to ksys_umount, and let other users of
path_umount know they need to perform their own validation.
[1]: https://github.com/libfuse/libfuse/blob/9bfbeb576c5901b62a171d35510f0d1a922…
[2]: https://github.com/karelzak/util-linux/blob/7ed579523b556b1270f28dbdb7ee07d…
Signed-off-by: Sargun Dhillon <sargun(a)sargun.me>
Cc: Christoph Hellwig <hch(a)lst.de>
Cc: stable(a)vger.kernel.org
Cc: Alexander Viro <viro(a)zeniv.linux.org.uk>
Cc: linux-fsdevel(a)vger.kernel.org
Fixes: 41525f56e256 ("fs: refactor ksys_umount")
---
fs/namespace.c | 13 +++++++++++--
1 file changed, 11 insertions(+), 2 deletions(-)
diff --git a/fs/namespace.c b/fs/namespace.c
index cebaa3e81794..752f82121dd4 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -1710,8 +1710,6 @@ static int can_umount(const struct path *path, int flags)
{
struct mount *mnt = real_mount(path->mnt);
- if (flags & ~(MNT_FORCE | MNT_DETACH | MNT_EXPIRE | UMOUNT_NOFOLLOW))
- return -EINVAL;
if (!may_mount())
return -EPERM;
if (path->dentry != path->mnt->mnt_root)
@@ -1725,6 +1723,13 @@ static int can_umount(const struct path *path, int flags)
return 0;
}
+
+/*
+ * path_umount - unmount by path
+ *
+ * path_umount does not check the validity of flags. It is up to the caller
+ * to ensure that it only contains valid umount options.
+ */
int path_umount(struct path *path, int flags)
{
struct mount *mnt = real_mount(path->mnt);
@@ -1746,6 +1751,10 @@ static int ksys_umount(char __user *name, int flags)
struct path path;
int ret;
+ /* Check flag validity first to allow probing of supported flags */
+ if (flags & ~(MNT_FORCE | MNT_DETACH | MNT_EXPIRE | UMOUNT_NOFOLLOW))
+ return -EINVAL;
+
if (!(flags & UMOUNT_NOFOLLOW))
lookup_flags |= LOOKUP_FOLLOW;
ret = user_path_at(AT_FDCWD, name, lookup_flags, &path);
--
2.25.1
If a transfer is dequeued, then the endpoint is freed to start a new
transfer. Make sure to clear the endpoint's transfer wait flag if
dequeued.
Cc: stable(a)vger.kernel.org
Fixes: e0d19563eb6c ("usb: dwc3: gadget: Wait for transfer completion")
Signed-off-by: Thinh Nguyen <Thinh.Nguyen(a)synopsys.com>
---
drivers/usb/dwc3/gadget.c | 1 +
1 file changed, 1 insertion(+)
diff --git a/drivers/usb/dwc3/gadget.c b/drivers/usb/dwc3/gadget.c
index 78cb4db8a6e4..9eaeb3c5d06c 100644
--- a/drivers/usb/dwc3/gadget.c
+++ b/drivers/usb/dwc3/gadget.c
@@ -1771,6 +1771,7 @@ static int dwc3_gadget_ep_dequeue(struct usb_ep *ep,
request, ep->name);
ret = -EINVAL;
out:
+ dep->flags &= ~DWC3_EP_WAIT_TRANSFER_COMPLETE;
spin_unlock_irqrestore(&dwc->lock, flags);
return ret;
base-commit: 5c8fe583cce542aa0b84adc939ce85293de36e5e
--
2.28.0
When inode has no listxattr op of its own (e.g. squashfs) vfs_listxattr
calls the LSM inode_listsecurity hooks to list the xattrs that LSMs will
intercept in inode_getxattr hooks.
When selinux LSM is installed but not initialized, it will list the
security.selinux xattr in inode_listsecurity, but will not intercept it
in inode_getxattr. This results in -ENODATA for a getxattr call for an
xattr returned by listxattr.
This situation was manifested as overlayfs failure to copy up lower
files from squashfs when selinux is built-in but not initialized,
because ovl_copy_xattr() iterates the lower inode xattrs by
vfs_listxattr() and vfs_getxattr().
Match the logic of inode_listsecurity to that of inode_getxattr and
do not list the security.selinux xattr if selinux is not initialized.
Reported-by: Michael Labriola <michael.d.labriola(a)gmail.com>
Tested-by: Michael Labriola <michael.d.labriola(a)gmail.com>
Link: https://lore.kernel.org/linux-unionfs/2nv9d47zt7.fsf@aldarion.sourceruckus.…
Fixes: c8e222616c7e ("selinux: allow reading labels before policy is loaded")
Cc: stable(a)vger.kernel.org#v5.9+
Signed-off-by: Amir Goldstein <amir73il(a)gmail.com>
---
security/selinux/hooks.c | 4 ++++
1 file changed, 4 insertions(+)
diff --git a/security/selinux/hooks.c b/security/selinux/hooks.c
index 6b1826fc3658..e132e082a5af 100644
--- a/security/selinux/hooks.c
+++ b/security/selinux/hooks.c
@@ -3406,6 +3406,10 @@ static int selinux_inode_setsecurity(struct inode *inode, const char *name,
static int selinux_inode_listsecurity(struct inode *inode, char *buffer, size_t buffer_size)
{
const int len = sizeof(XATTR_NAME_SELINUX);
+
+ if (!selinux_initialized(&selinux_state))
+ return 0;
+
if (buffer && len <= buffer_size)
memcpy(buffer, XATTR_NAME_SELINUX, len);
return len;
--
2.25.1