syzbot reported we have a use-after-free when mqueue_evict_inode() is called on __cleanup_mnt() path, where the ipc ns is already freed by the previous exit_task_namespaces(). We can just move it after after exit_task_work() to avoid this use-after-free.
Reported-by: syzbot syzkaller@googlegroups.com Cc: Ingo Molnar mingo@kernel.org Cc: Al Viro viro@zeniv.linux.org.uk Cc: Andrew Morton akpm@linux-foundation.org Cc: Linus Torvalds torvalds@linux-foundation.org Cc: stable@vger.kernel.org Signed-off-by: Cong Wang xiyou.wangcong@gmail.com --- kernel/exit.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/kernel/exit.c b/kernel/exit.c index 6b4298a41167..909e43c45158 100644 --- a/kernel/exit.c +++ b/kernel/exit.c @@ -861,8 +861,8 @@ void __noreturn do_exit(long code) exit_fs(tsk); if (group_dead) disassociate_ctty(1); - exit_task_namespaces(tsk); exit_task_work(tsk); + exit_task_namespaces(tsk); exit_thread(tsk);
/*
On Thu, Dec 14, 2017 at 12:17:57PM -0800, Cong Wang wrote:
syzbot reported we have a use-after-free when mqueue_evict_inode() is called on __cleanup_mnt() path, where the ipc ns is already freed by the previous exit_task_namespaces(). We can just move it after after exit_task_work() to avoid this use-after-free.
What's to prevent somebody else holding a reference to the same inode past the exit(2)? IOW, I don't believe that this is fixing anything - in the best case, your patch papers over a specific reproducer.
On Thu, Dec 14, 2017 at 1:08 PM, Al Viro viro@zeniv.linux.org.uk wrote:
On Thu, Dec 14, 2017 at 12:17:57PM -0800, Cong Wang wrote:
syzbot reported we have a use-after-free when mqueue_evict_inode() is called on __cleanup_mnt() path, where the ipc ns is already freed by the previous exit_task_namespaces(). We can just move it after after exit_task_work() to avoid this use-after-free.
What's to prevent somebody else holding a reference to the same inode past the exit(2)? IOW, I don't believe that this is fixing anything - in the best case, your patch papers over a specific reproducer.
You are right, I missed mq_clear_sbinfo().
And the offending commit is:
commit 9c583773d036336176e9e50441890659bc4eeae8 Author: Giuseppe Scrivano gscrivan@redhat.com Date: Fri Dec 15 01:06:28 2017 +0000
ipc, mqueue: lazy call kern_mount_data in new namespaces
kern_mount_data is a relatively expensive operation when creating a new IPC namespace, so delay the mount until its first usage when not creating the the global namespace.
This is a net saving for new IPC namespaces that don't use mq_open(). In this case there won't be any kern_mount_data() cost at all.
On my machine, the time for creating 1000 new IPC namespaces dropped from ~8s to ~2s.
Link: http://lkml.kernel.org/r/20171206151422.9660-1-gscrivan@redhat.com Signed-off-by: Giuseppe Scrivano gscrivan@redhat.com Cc: Manfred Spraul manfred@colorfullife.com Cc: Davidlohr Bueso dave@stgolabs.net Cc: Al Viro viro@zeniv.linux.org.uk Signed-off-by: Andrew Morton akpm@linux-foundation.org
Cong Wang xiyou.wangcong@gmail.com writes:
On Thu, Dec 14, 2017 at 1:08 PM, Al Viro viro@zeniv.linux.org.uk wrote:
On Thu, Dec 14, 2017 at 12:17:57PM -0800, Cong Wang wrote:
syzbot reported we have a use-after-free when mqueue_evict_inode() is called on __cleanup_mnt() path, where the ipc ns is already freed by the previous exit_task_namespaces(). We can just move it after after exit_task_work() to avoid this use-after-free.
What's to prevent somebody else holding a reference to the same inode past the exit(2)? IOW, I don't believe that this is fixing anything - in the best case, your patch papers over a specific reproducer.
You are right, I missed mq_clear_sbinfo().
yes, the patch 9c583773d036336176e9e50441890659bc4eeae8 introduced an issue since it doesn't reset s_fs_info when ns->mq_mnt == NULL.
The following patch solves the issue for me. I store the superblock in "struct ipc_namespace" since we cannot assume "mq_mnt" exists.
Does the patch look fine? I'll clean it up and submit it separately if this approach is correct.
Regards, Giuseppe
diff --git a/include/linux/ipc_namespace.h b/include/linux/ipc_namespace.h index 554e31494f69..66d4a5740a60 100644 --- a/include/linux/ipc_namespace.h +++ b/include/linux/ipc_namespace.h @@ -53,6 +53,7 @@ struct ipc_namespace {
/* The kern_mount of the mqueuefs sb. We take a ref on it */ struct vfsmount *mq_mnt; + struct super_block *mq_sb;
/* # queues in this ns, protected by mq_lock */ unsigned int mq_queues_count; diff --git a/ipc/mqueue.c b/ipc/mqueue.c index f78d6687a61b..16347cf1de2d 100644 --- a/ipc/mqueue.c +++ b/ipc/mqueue.c @@ -341,6 +341,7 @@ static int mqueue_fill_super(struct super_block *sb, void *data, int silent) sb->s_root = d_make_root(inode); if (!sb->s_root) return -ENOMEM; + ns->mq_sb = sb; return 0; }
@@ -1554,6 +1555,7 @@ int mq_init_ns(struct ipc_namespace *ns, bool mount) ns->mq_msg_max = DFLT_MSGMAX; ns->mq_msgsize_max = DFLT_MSGSIZEMAX; ns->mq_msg_default = DFLT_MSG; + ns->mq_sb = NULL; ns->mq_msgsize_default = DFLT_MSGSIZE;
if (!mount) @@ -1573,8 +1575,8 @@ int mq_init_ns(struct ipc_namespace *ns, bool mount)
void mq_clear_sbinfo(struct ipc_namespace *ns) { - if (ns->mq_mnt) - ns->mq_mnt->mnt_sb->s_fs_info = NULL; + if (ns->mq_sb) + ns->mq_sb->s_fs_info = NULL; }
void mq_put_mnt(struct ipc_namespace *ns)
Cong Wang xiyou.wangcong@gmail.com writes:
syzbot reported we have a use-after-free when mqueue_evict_inode() is called on __cleanup_mnt() path, where the ipc ns is already freed by the previous exit_task_namespaces(). We can just move it after after exit_task_work() to avoid this use-after-free.
How does that possibly work. (I haven't seen this syzbot report).
Looking at the code we have get_ns_from_inode. Which takes the mq_lock, sees if the pointer is NULL and takes a reference if it is non-NULL.
Meanwhile put_ipc_ns calls mq_clear_sbinfo(ns) with the mq_lock held when the count drops to zero.
Where is the race in that?
The rest of mqueue_evict_inode uses the returned pointer and tests that the pointer is non-NULL before user it.
So either szbot is giving you a bad report or there is a subtle race there I am not seeing. The change below is not at all the proper way to fix a subtle race.
Eric
Reported-by: syzbot syzkaller@googlegroups.com Cc: Ingo Molnar mingo@kernel.org Cc: Al Viro viro@zeniv.linux.org.uk Cc: Andrew Morton akpm@linux-foundation.org Cc: Linus Torvalds torvalds@linux-foundation.org Cc: stable@vger.kernel.org Signed-off-by: Cong Wang xiyou.wangcong@gmail.com
kernel/exit.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/kernel/exit.c b/kernel/exit.c index 6b4298a41167..909e43c45158 100644 --- a/kernel/exit.c +++ b/kernel/exit.c @@ -861,8 +861,8 @@ void __noreturn do_exit(long code) exit_fs(tsk); if (group_dead) disassociate_ctty(1);
- exit_task_namespaces(tsk); exit_task_work(tsk);
- exit_task_namespaces(tsk); exit_thread(tsk);
/*
On Fri, Dec 15, 2017 at 7:56 AM, Eric W. Biederman ebiederm@xmission.com wrote:
Cong Wang xiyou.wangcong@gmail.com writes:
syzbot reported we have a use-after-free when mqueue_evict_inode() is called on __cleanup_mnt() path, where the ipc ns is already freed by the previous exit_task_namespaces(). We can just move it after after exit_task_work() to avoid this use-after-free.
How does that possibly work. (I haven't seen this syzbot report).
Looking at the code we have get_ns_from_inode. Which takes the mq_lock, sees if the pointer is NULL and takes a reference if it is non-NULL.
Meanwhile put_ipc_ns calls mq_clear_sbinfo(ns) with the mq_lock held when the count drops to zero.
Where is the race in that?
The rest of mqueue_evict_inode uses the returned pointer and tests that the pointer is non-NULL before user it.
So either szbot is giving you a bad report or there is a subtle race there I am not seeing. The change below is not at all the proper way to fix a subtle race.
Eric
Cong, what was that report? Searching by "exit_task_work|exit_task_namespaces" there are too many of them: https://groups.google.com/forum/#%21searchin/syzkaller-bugs/%22exit_task_wor...
I can only say that syzbot does not make up reports. That's something that actually happened and was provoked by userspace.
Reported-by: syzbot syzkaller@googlegroups.com Cc: Ingo Molnar mingo@kernel.org Cc: Al Viro viro@zeniv.linux.org.uk Cc: Andrew Morton akpm@linux-foundation.org Cc: Linus Torvalds torvalds@linux-foundation.org Cc: stable@vger.kernel.org Signed-off-by: Cong Wang xiyou.wangcong@gmail.com
kernel/exit.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/kernel/exit.c b/kernel/exit.c index 6b4298a41167..909e43c45158 100644 --- a/kernel/exit.c +++ b/kernel/exit.c @@ -861,8 +861,8 @@ void __noreturn do_exit(long code) exit_fs(tsk); if (group_dead) disassociate_ctty(1);
exit_task_namespaces(tsk); exit_task_work(tsk);
exit_task_namespaces(tsk); exit_thread(tsk); /*
On Fri, Dec 15, 2017 at 8:35 AM, Dmitry Vyukov dvyukov@google.com wrote:
On Fri, Dec 15, 2017 at 7:56 AM, Eric W. Biederman ebiederm@xmission.com wrote:
Cong Wang xiyou.wangcong@gmail.com writes:
syzbot reported we have a use-after-free when mqueue_evict_inode() is called on __cleanup_mnt() path, where the ipc ns is already freed by the previous exit_task_namespaces(). We can just move it after after exit_task_work() to avoid this use-after-free.
How does that possibly work. (I haven't seen this syzbot report).
Looking at the code we have get_ns_from_inode. Which takes the mq_lock, sees if the pointer is NULL and takes a reference if it is non-NULL.
Meanwhile put_ipc_ns calls mq_clear_sbinfo(ns) with the mq_lock held when the count drops to zero.
Where is the race in that?
The rest of mqueue_evict_inode uses the returned pointer and tests that the pointer is non-NULL before user it.
So either szbot is giving you a bad report or there is a subtle race there I am not seeing. The change below is not at all the proper way to fix a subtle race.
Eric
Cong, what was that report? Searching by "exit_task_work|exit_task_namespaces" there are too many of them: https://groups.google.com/forum/#%21searchin/syzkaller-bugs/%22exit_task_wor...
I can only say that syzbot does not make up reports. That's something that actually happened and was provoked by userspace.
Ah, found that bug: https://groups.google.com/d/msg/syzkaller-bugs/1XBaqnPSXzs/VF-eCSPuCQAJ
Reported-by: syzbot syzkaller@googlegroups.com Cc: Ingo Molnar mingo@kernel.org Cc: Al Viro viro@zeniv.linux.org.uk Cc: Andrew Morton akpm@linux-foundation.org Cc: Linus Torvalds torvalds@linux-foundation.org Cc: stable@vger.kernel.org Signed-off-by: Cong Wang xiyou.wangcong@gmail.com
kernel/exit.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/kernel/exit.c b/kernel/exit.c index 6b4298a41167..909e43c45158 100644 --- a/kernel/exit.c +++ b/kernel/exit.c @@ -861,8 +861,8 @@ void __noreturn do_exit(long code) exit_fs(tsk); if (group_dead) disassociate_ctty(1);
exit_task_namespaces(tsk); exit_task_work(tsk);
exit_task_namespaces(tsk); exit_thread(tsk); /*
On Fri, Dec 15, 2017 at 12:00 AM, Dmitry Vyukov dvyukov@google.com wrote:
On Fri, Dec 15, 2017 at 8:35 AM, Dmitry Vyukov dvyukov@google.com wrote:
On Fri, Dec 15, 2017 at 7:56 AM, Eric W. Biederman ebiederm@xmission.com wrote:
Cong Wang xiyou.wangcong@gmail.com writes:
syzbot reported we have a use-after-free when mqueue_evict_inode() is called on __cleanup_mnt() path, where the ipc ns is already freed by the previous exit_task_namespaces(). We can just move it after after exit_task_work() to avoid this use-after-free.
How does that possibly work. (I haven't seen this syzbot report).
Looking at the code we have get_ns_from_inode. Which takes the mq_lock, sees if the pointer is NULL and takes a reference if it is non-NULL.
Meanwhile put_ipc_ns calls mq_clear_sbinfo(ns) with the mq_lock held when the count drops to zero.
Where is the race in that?
The rest of mqueue_evict_inode uses the returned pointer and tests that the pointer is non-NULL before user it.
So either szbot is giving you a bad report or there is a subtle race there I am not seeing. The change below is not at all the proper way to fix a subtle race.
Eric
Cong, what was that report? Searching by "exit_task_work|exit_task_namespaces" there are too many of them: https://groups.google.com/forum/#%21searchin/syzkaller-bugs/%22exit_task_wor...
I can only say that syzbot does not make up reports. That's something that actually happened and was provoked by userspace.
Ah, found that bug: https://groups.google.com/d/msg/syzkaller-bugs/1XBaqnPSXzs/VF-eCSPuCQAJ
Yeah, and it is introduced by:
http://git.cmpxchg.org/cgit.cgi/linux-mmots.git/commit/?id=9c583773d03633617...
On Sat, Dec 16, 2017 at 1:00 AM, Cong Wang xiyou.wangcong@gmail.com wrote:
syzbot reported we have a use-after-free when mqueue_evict_inode() is called on __cleanup_mnt() path, where the ipc ns is already freed by the previous exit_task_namespaces(). We can just move it after after exit_task_work() to avoid this use-after-free.
How does that possibly work. (I haven't seen this syzbot report).
Looking at the code we have get_ns_from_inode. Which takes the mq_lock, sees if the pointer is NULL and takes a reference if it is non-NULL.
Meanwhile put_ipc_ns calls mq_clear_sbinfo(ns) with the mq_lock held when the count drops to zero.
Where is the race in that?
The rest of mqueue_evict_inode uses the returned pointer and tests that the pointer is non-NULL before user it.
So either szbot is giving you a bad report or there is a subtle race there I am not seeing. The change below is not at all the proper way to fix a subtle race.
Eric
Cong, what was that report? Searching by "exit_task_work|exit_task_namespaces" there are too many of them: https://groups.google.com/forum/#%21searchin/syzkaller-bugs/%22exit_task_wor...
I can only say that syzbot does not make up reports. That's something that actually happened and was provoked by userspace.
Ah, found that bug: https://groups.google.com/d/msg/syzkaller-bugs/1XBaqnPSXzs/VF-eCSPuCQAJ
Yeah, and it is introduced by:
http://git.cmpxchg.org/cgit.cgi/linux-mmots.git/commit/?id=9c583773d03633617...
If it's going to be discussed here, note that it was reported by syzbot and requires "#syz fix" tag after fixing here: https://groups.google.com/d/msg/syzkaller-bugs/1XBaqnPSXzs/jGDCMoz_AQAJ
linux-stable-mirror@lists.linaro.org