From: Will Deacon will@kernel.org
Closing /dev/pts/ptmx removes the corresponding pty under /dev/pts/ without synchronizing against concurrent path walkers. This can lead to 'dcache_readdir()' tripping over a 'struct dentry' with a NULL 'd_inode' field:
| BUG: kernel NULL pointer dereference, address: 0000000000000000 | #PF: supervisor read access in kernel mode | #PF: error_code(0x0000) - not-present page | PGD 0 P4D 0 | SMP PTI | CPU: 9 PID: 179 Comm: ptmx Not tainted 5.4.0-rc1+ #5 | RIP: 0010:dcache_readdir+0xe1/0x150 | Code: 48 83 f8 01 74 a2 48 83 f8 02 74 eb 4d 8d a7 90 00 00 00 45 31 f6 eb 42 48 8b 43 30 48 8b 4d 08 48 89 ef 8b 53 24 48 8b 73 28 <44> 0f b7 08 4c 8b 55 00 4c 8b 40 40 66 41 c1 e9 0c 41 83 e1 0f e8 | RSP: 0018:ffffa7df8044be58 EFLAGS: 00010286 | RAX: 0000000000000000 RBX: ffff9511c78f3ec0 RCX: 0000000000000002 | RDX: 0000000000000001 RSI: ffff9511c78f3ef8 RDI: ffffa7df8044bed0 | RBP: ffffa7df8044bed0 R08: 0000000000000000 R09: 00000000004bc478 | R10: ffff9511c877c6a8 R11: ffff9511c8dde600 R12: ffff9511c878c460 | R13: ffff9511c878c3c0 R14: 0000000000000000 R15: ffff9511c9442cc0 | FS: 00007fc5ea2e1700(0000) GS:ffff9511ca280000(0000) knlGS:0000000000000000 | CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 | CR2: 0000000000000000 CR3: 0000000047d68002 CR4: 0000000000760ea0 | DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 | DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 | PKRU: 55555554 | Call Trace: | iterate_dir+0x137/0x190 | ksys_getdents64+0x97/0x130 | ? iterate_dir+0x190/0x190 | __x64_sys_getdents64+0x11/0x20 | do_syscall_64+0x43/0x110 | entry_SYSCALL_64_after_hwframe+0x44/0xa9
In this case, one CPU is deleting the dentry and clearing the inode pointer via:
devpts_pty_kill() -> dput() -> dentry_kill() -> __dentry_kill() -> dentry_unlink_inode()
whilst the other is traversing the directory an obtaining a reference to the dentry being deleted via:
sys_getdents64() -> iterate_dir() -> dcache_readdir() -> next_positive()
Prevent the race by acquiring the inode lock of the parent in 'devpts_pty_kill()' so that path walkers are held off until the dentry has been completely torn down.
Will's fix didn't link to the commit it fixes so I tracked it down. devpts_pty_kill() used to take inode_lock() before removing a pts device. The inode_lock() got removed in 8ead9dd54716 ("devpts: more pty driver interface cleanups"). The reasoning behind the removal seemed to be that the inode_lock() was only needed because d_find_alias(inode) had to be called before that commit to find the dentry that was supposed to be removed. Linus then changed the pty driver to stash away the dentry and subsequently got rid of the inode_lock(). However, it seems that the inode_lock() is needed to protect against the race outlined above. So add it back.
Note that this bug had been brought up before in November 2018 before (cf. [1]). But a fix never got merged because a proper commit wasn't sent. The issue came back up when Will and I talked about it at Kernel Recipes in Paris. So here is a fix which prevents the issue. I very much vote we get this merged asap, since as an unprivileged user I can do:
unshare -U --map-root --mount mount -t devpts devpts ./<run_reproducer_below
/* Reproducer */ Note that the reproducer will take a little while. Will reported usually around 10s. For me it took a few minutes.
#ifndef _GNU_SOURCE #define _GNU_SOURCE 1 #endif #include <dirent.h> #include <errno.h> #include <fcntl.h> #include <pthread.h> #include <stdio.h> #include <stdlib.h> #include <sys/stat.h> #include <sys/types.h> #include <unistd.h>
static void *readdir_thread(void *arg) { DIR *d = arg; struct dirent *dent;
for (;;) { errno = 0; dent = readdir(d); if (!dent) { if (errno) perror("readdir"); break; } rewinddir(d); }
return NULL; }
int main(void) { DIR *d; pthread_t t; int ret = 0;
d = opendir("/dev/pts"); if (!d) { ret = errno; perror("opendir"); exit(EXIT_FAILURE); }
ret = pthread_create(&t, NULL, readdir_thread, d); if (ret) { errno = ret; perror("pthread_create"); exit(EXIT_FAILURE); }
for (;;) { int dfd; int fd;
dfd = dirfd(d); if (dfd < 0) { perror("dirfd"); break; }
fd = openat(dirfd(d), "ptmx", O_RDONLY | O_NONBLOCK | O_CLOEXEC); if (fd < 0) { perror("openat"); break; } close(fd); }
pthread_join(t, NULL); closedir(d); exit(EXIT_SUCCESS); }
/* References */ [1]: https://lore.kernel.org/r/20181109143744.GA12128@hc
Fixes: 8ead9dd54716 ("devpts: more pty driver interface cleanups") Cc: stable@vger.kernel.org Cc: Jan Glauber jglauber@marvell.com Cc: Alexander Viro viro@zeniv.linux.org.uk Cc: Greg Kroah-Hartman gregkh@linuxfoundation.org Cc: Linus Torvalds torvalds@linux-foundation.org Signed-off-by: Will Deacon will@kernel.org Reviewed-by: Christian Brauner christian.brauner@ubuntu.com [christian.brauner@ubuntu.com: dig into history and add context and reproducer to commit message] Signed-off-by: Christian Brauner christian.brauner@ubuntu.com --- fs/devpts/inode.c | 7 ++++++- 1 file changed, 6 insertions(+), 1 deletion(-)
diff --git a/fs/devpts/inode.c b/fs/devpts/inode.c index 42e5a766d33c..4b4546347aac 100644 --- a/fs/devpts/inode.c +++ b/fs/devpts/inode.c @@ -617,13 +617,18 @@ void *devpts_get_priv(struct dentry *dentry) */ void devpts_pty_kill(struct dentry *dentry) { - WARN_ON_ONCE(dentry->d_sb->s_magic != DEVPTS_SUPER_MAGIC); + struct super_block *sb = dentry->d_sb; + struct dentry *parent = sb->s_root;
+ WARN_ON_ONCE(sb->s_magic != DEVPTS_SUPER_MAGIC); + + inode_lock(parent->d_inode); dentry->d_fsdata = NULL; drop_nlink(dentry->d_inode); fsnotify_unlink(d_inode(dentry->d_parent), dentry); d_drop(dentry); dput(dentry); /* d_alloc_name() in devpts_pty_new() */ + inode_unlock(parent->d_inode); }
static int __init init_devpts_fs(void)
On Fri, Oct 04, 2019 at 04:05:03PM +0200, Christian Brauner wrote:
From: Will Deacon will@kernel.org
Closing /dev/pts/ptmx removes the corresponding pty under /dev/pts/ without synchronizing against concurrent path walkers. This can lead to 'dcache_readdir()' tripping over a 'struct dentry' with a NULL 'd_inode' field:
FWIW, vfs.git#fixes (or #next.dcache) ought to deal with that one.
On Fri, Oct 04, 2019 at 03:27:48PM +0100, Al Viro wrote:
On Fri, Oct 04, 2019 at 04:05:03PM +0200, Christian Brauner wrote:
From: Will Deacon will@kernel.org
Closing /dev/pts/ptmx removes the corresponding pty under /dev/pts/ without synchronizing against concurrent path walkers. This can lead to 'dcache_readdir()' tripping over a 'struct dentry' with a NULL 'd_inode' field:
FWIW, vfs.git#fixes (or #next.dcache) ought to deal with that one.
Is it feasible to backport your changes? Or do we want to merge the one here first and backport?
Christian
On Fri, Oct 04, 2019 at 04:33:02PM +0200, Christian Brauner wrote:
On Fri, Oct 04, 2019 at 03:27:48PM +0100, Al Viro wrote:
On Fri, Oct 04, 2019 at 04:05:03PM +0200, Christian Brauner wrote:
From: Will Deacon will@kernel.org
Closing /dev/pts/ptmx removes the corresponding pty under /dev/pts/ without synchronizing against concurrent path walkers. This can lead to 'dcache_readdir()' tripping over a 'struct dentry' with a NULL 'd_inode' field:
FWIW, vfs.git#fixes (or #next.dcache) ought to deal with that one.
Is it feasible to backport your changes? Or do we want to merge the one here first and backport?
I'm not sure. The whole pile is backportable, all right (and the first commit alone should take care of devpts problem). However, there's a performance regression on some loads; it *is* possible to get the thing reasonably lockless without fucking it up (as the original conversion had been). Still not in the series, since cifs (ab)use of dcache_readdir() needs to be clarified to figure out the right way to do it. Asked CIFS folks, got no reaction whatsoever, will ask again...
Al, mostly back after flu, digging through the piles of mail
On Fri, Oct 04, 2019 at 04:10:58PM +0100, Al Viro wrote:
On Fri, Oct 04, 2019 at 04:33:02PM +0200, Christian Brauner wrote:
On Fri, Oct 04, 2019 at 03:27:48PM +0100, Al Viro wrote:
On Fri, Oct 04, 2019 at 04:05:03PM +0200, Christian Brauner wrote:
From: Will Deacon will@kernel.org
Closing /dev/pts/ptmx removes the corresponding pty under /dev/pts/ without synchronizing against concurrent path walkers. This can lead to 'dcache_readdir()' tripping over a 'struct dentry' with a NULL 'd_inode' field:
FWIW, vfs.git#fixes (or #next.dcache) ought to deal with that one.
Is it feasible to backport your changes? Or do we want to merge the one here first and backport?
I'm not sure. The whole pile is backportable, all right (and the first commit
Ok. So here's what I propose: we'll merge this one as it seems an obvious fix to the problem and can easily be backported to stable kernels. Then you'll land your generic workaround alleviating callers from holding inode_lock(). Then I'll send a patch to remove the inode_lock() from devpts for master. If we see that your fix is fine to backport and has no performance impacts that you find unacceptable we backport it.
alone should take care of devpts problem). However, there's a performance regression on some loads; it *is* possible to get the thing reasonably lockless without fucking it up (as the original conversion had been). Still not in the series, since cifs (ab)use of dcache_readdir() needs to be clarified to figure out the right way to do it. Asked CIFS folks, got no reaction whatsoever, will ask again...
Al, mostly back after flu, digging through the piles of mail
Ugh, flu season...
Christian
On Fri, Oct 04, 2019 at 05:25:28PM +0200, Christian Brauner wrote:
On Fri, Oct 04, 2019 at 04:10:58PM +0100, Al Viro wrote:
On Fri, Oct 04, 2019 at 04:33:02PM +0200, Christian Brauner wrote:
On Fri, Oct 04, 2019 at 03:27:48PM +0100, Al Viro wrote:
On Fri, Oct 04, 2019 at 04:05:03PM +0200, Christian Brauner wrote:
From: Will Deacon will@kernel.org
Closing /dev/pts/ptmx removes the corresponding pty under /dev/pts/ without synchronizing against concurrent path walkers. This can lead to 'dcache_readdir()' tripping over a 'struct dentry' with a NULL 'd_inode' field:
FWIW, vfs.git#fixes (or #next.dcache) ought to deal with that one.
Is it feasible to backport your changes? Or do we want to merge the one here first and backport?
I'm not sure. The whole pile is backportable, all right (and the first commit
Ok. So here's what I propose: we'll merge this one as it seems an obvious fix to the problem and can easily be backported to stable kernels. Then you'll land your generic workaround alleviating callers from holding inode_lock(). Then I'll send a patch to remove the inode_lock() from devpts for master. If we see that your fix is fine to backport and has no performance impacts that you find unacceptable we backport it.
There's more than one bug here. * fucked up lockless traversals. Affect anything that uses dcache_readdir() * devpts (and selinuxfs, while we are at it) running afoul of (implicit) assumption by dcache_readdir() - that stuff won't get removed from under it * (possibly) cifs hitting the same on eviction by memory pressure alone (no locked inodes anywhere in sight). Possibly == if cifs IPC$ share happens to show up non-empty (e.g. due to server playing silly buggers). * (possibly) cifs hitting *another* lovely issue - lookup in one subdirectory of IPC$ root finding an alias for another subdirectory of said root, triggering d_move() of dentry of the latter. IF the name happens to be long enough to be externally allocated and if dcache_readdir() on root is currently copying it to userland, Bad Things(tm) will happen. That one almost certainly depends upon the server playing silly buggers and might or might not be possible. I'm not familiar enough with CIFS to tell.
The first 3 are dealt with by the first commit in that pile; the last one is not. devpts patch of yours would deal with a part of the second bug. Performance regression comes with fixing the first one, which is also quite real. There might be a way to avoid that performance hit, but it will be harder to backport.
FWIW, some discussion of that fun went in a thread shortly before the merge window - look for "Possible FS race condition between iterate_dir and d_alloc_parallel" on fsdevel. Some of that went off-list, though...
On Fri, Oct 04, 2019 at 05:02:20PM +0100, Al Viro wrote:
- (possibly) cifs hitting the same on eviction by memory pressure alone
(no locked inodes anywhere in sight). Possibly == if cifs IPC$ share happens to show up non-empty (e.g. due to server playing silly buggers).
- (possibly) cifs hitting *another* lovely issue - lookup in one subdirectory
of IPC$ root finding an alias for another subdirectory of said root, triggering d_move() of dentry of the latter. IF the name happens to be long enough to be externally allocated and if dcache_readdir() on root is currently copying it to userland, Bad Things(tm) will happen. That one almost certainly depends upon the server playing silly buggers and might or might not be possible. I'm not familiar enough with CIFS to tell.
BTW, I would really appreciate somebody familiar with CIFS giving a braindump on that. Questions:
1) What's normally (== without malicious/broken server) seen when you mount an IPC$ share?
2) Does it ever have subdirectories (i.e. can we fail a lookup in its root if it looks like returning a subdirectory)?
3) If it can be non-empty, is there way to ask the server about its contents? Short of "look every possible name up", that is...
As it is, the thing is abusing either cifs_lookup() (if it really shouldn't have any files in it) or dcache_readdir(). Sure, dcache_readdir() can (and should) pin a dentry while copying the name to userland, but WTF kind of semantics it is? "ls will return the things you'd guessed to look up, until there's enough memory pressure to have them forgotten, which can happen at any point with no activity on server"?
Your questions are interesting and rarely asked.
On Fri, Oct 4, 2019 at 11:57 AM Al Viro viro@zeniv.linux.org.uk wrote:
On Fri, Oct 04, 2019 at 05:02:20PM +0100, Al Viro wrote:
* (possibly) cifs hitting the same on eviction by memory pressure alone
(no locked inodes anywhere in sight). Possibly == if cifs IPC$ share happens to show up non-empty (e.g. due to server playing silly buggers). * (possibly) cifs hitting *another* lovely issue - lookup in one subdirectory of IPC$ root finding an alias for another subdirectory of said root, triggering d_move() of dentry of the latter. IF the name happens to be long enough to be externally allocated and if dcache_readdir() on root is currently copying it to userland, Bad Things(tm) will happen. That one almost certainly depends upon the server playing silly buggers and might or might not be possible. I'm not familiar enough with CIFS to tell.
BTW, I would really appreciate somebody familiar with CIFS giving a braindump on that. Questions:
- What's normally (== without malicious/broken server) seen when you mount
an IPC$ share?
IPC$ is for "inter process communication" so is basically an abstraction for named pipes (used for remote procedure call queries using the old DCE/RPC standard).
To Windows it is possible to mount IPC$, to Samba you can connect to the share but due to a Samba server bug you can't do a query info on "." (the 'root' of the IPC$ share).
- Does it ever have subdirectories (i.e. can we fail a lookup in its root if it
looks like returning a subdirectory)?
In Samba you can't query subdirectories on IPC$ because even open of "." fails, but to Windows the query directory would get "STATUS_INVALID_INFO_CLASS"
An interesting question, and one that I will bring up with the spec writers is whether there are info level which would be allowed for query directory (probably not).
Another interesting question this brings up is ... "should we allow enumerating the 'services' under IPC$ via readdir"? You could imagine a case where mounting IPC$ would allow you to see the 'services' exported by the server over remote procedure call ("server service" and "workstation server" and "netlogon service" and the global name space (DFS) service and perhaps "witness protocol services" and "branch cache service" etc.)
And then thinking about Dave Howell's changes to the mount API - should this be a mechanism that is allowed to be used to either browse the valid shares or better access the root of the (DFS) global name space.
But the short answer is "no you can't query the directory contents under IPC$" (at least not without changing the abstraction that we export on the client) but I am open to ideas if this would fit with Dave Howell's changes to the mount API or other ideas.
- If it can be non-empty, is there way to ask the server about its contents?
Short of "look every possible name up", that is...
As it is, the thing is abusing either cifs_lookup() (if it really shouldn't have any files in it) or dcache_readdir(). Sure, dcache_readdir() can (and should) pin a dentry while copying the name to userland, but WTF kind of semantics it is? "ls will return the things you'd guessed to look up, until there's enough memory pressure to have them forgotten, which can happen at any point with no activity on server"?
Server's realistically must expose a share "IPC$" so in theory it can be mounted (despite Samba server's current bug there) and there were some experiments that Shirish did a few years ago opening well known services under mounts to IPC$ (to allow doing remote procedure calls over SMB3 mounts which has some value) but AFAIK you would never do a readdir over IPC$ and no current users would ever mount IPC$
On Fri, Oct 4, 2019 at 7:27 AM Al Viro viro@zeniv.linux.org.uk wrote:
FWIW, vfs.git#fixes (or #next.dcache) ought to deal with that one.
Dang, I thought this already got merged. But we only discussed it extensively and I guess it got delayed by all the discussions about possible fixes for the d_lock contention.
Al, mind sending me that one - and honestly, I'd take the "cursors off the list at the end" patch too. That may not be stable material, but I still think it's going to help the d_lock contention at least partially in practice.
Linus
On Fri, Oct 4, 2019 at 9:52 AM Linus Torvalds torvalds@linux-foundation.org wrote:
Dang, I thought this already got merged. But we only discussed it extensively and I guess it got delayed by all the discussions about possible fixes for the d_lock contention.
Side note: I'm not all _that_ worried about the d_lock contention thing, simply because the regression report was from an artificial benchmark, and it was on a 144-core machine.
Compared to the Oops, it's not really a thing.
Linus
linux-stable-mirror@lists.linaro.org