[PATCH] ovl: check before dereferencing s_root field

List overview All Threads
Download

newer

older

[PATCH] fs/notify: call...

[PATCH v2] mm/ksm: fix...

Jakub Acs

15 Sep 2025 15 Sep '25

10:15 a.m.

Calling intotify_show_fdinfo() on fd watching an overlayfs inode, while the overlayfs is being unmounted, can lead to dereferencing NULL ptr.

This issue was found by syzkaller.

Race Condition Diagram:

Thread 1 Thread 2 -------- --------

generic_shutdown_super() shrink_dcache_for_umount sb->s_root = NULL

Which then leads to:

[ 32.133461] Oops: general protection fault, probably for non-canonical address 0xdffffc0000000006: 0000 [#1] SMP DEBUG_PAGEALLOC KASAN NOPTI [ 32.134438] KASAN: null-ptr-deref in range [0x0000000000000030-0x0000000000000037] [ 32.135032] CPU: 1 UID: 0 PID: 4468 Comm: systemd-coredum Not tainted 6.17.0-rc6 #22 PREEMPT(none)

[ 32.143353] Call Trace: [ 32.143732] ovl_encode_fh+0xd5/0x170 [ 32.144031] exportfs_encode_inode_fh+0x12f/0x300 [ 32.144425] show_mark_fhandle+0xbe/0x1f0 [ 32.145805] inotify_fdinfo+0x226/0x2d0 [ 32.146442] inotify_show_fdinfo+0x1c5/0x350 [ 32.147168] seq_show+0x530/0x6f0 [ 32.147449] seq_read_iter+0x503/0x12a0 [ 32.148419] seq_read+0x31f/0x410 [ 32.150714] vfs_read+0x1f0/0x9e0 [ 32.152297] ksys_read+0x125/0x240

IOW ovl_check_encode_origin derefs inode->i_sb->s_root, after it was set to NULL in the unmount path.

Minimize the window of opportunity by adding explicit check.

Fixes: c45beebfde34 ("ovl: support encoding fid from inode with no alias") Signed-off-by: Jakub Acs acsjakub@amazon.de Cc: Miklos Szeredi miklos@szeredi.hu Cc: Amir Goldstein amir73il@gmail.com Cc: linux-unionfs@vger.kernel.org Cc: linux-kernel@vger.kernel.org Cc: stable@vger.kernel.org ---

I'm happy to take suggestions for a better fix - I looked at taking s_umount for reading, but it wasn't clear to me for how long would the fdinfo path need to hold it. Hence the most primitive suggestion in this v1.

I'm also not sure if ENOENT or EBUSY is better?.. or even something else?

fs/overlayfs/export.c | 2 ++ 1 file changed, 2 insertions(+)

diff --git a/fs/overlayfs/export.c b/fs/overlayfs/export.c index 83f80fdb1567..424c73188e06 100644 --- a/fs/overlayfs/export.c +++ b/fs/overlayfs/export.c @@ -195,6 +195,8 @@ static int ovl_check_encode_origin(struct inode *inode) if (!ovl_inode_lower(inode)) return 0;

+ if (!inode->i_sb->s_root) + return -ENOENT; /* * Root is never indexed, so if there's an upper layer, encode upper for * root.

-- 2.47.3 Amazon Web Services Development Center Germany GmbH Tamara-Danz-Str. 13 10243 Berlin Geschaeftsfuehrung: Christian Schlaeger, Jonathan Weiss Eingetragen am Amtsgericht Charlottenburg unter HRB 257764 B Sitz: Berlin Ust-ID: DE 365 538 597

Show replies by date

Amir Goldstein

15 Sep 15 Sep

1:01 p.m.

On Mon, Sep 15, 2025 at 12:15 PM Jakub Acs acsjakub@amazon.de wrote:

...

Calling intotify_show_fdinfo() on fd watching an overlayfs inode, while the overlayfs is being unmounted, can lead to dereferencing NULL ptr.

This issue was found by syzkaller.

Race Condition Diagram:

Thread 1 Thread 2

generic_shutdown_super() shrink_dcache_for_umount sb->s_root = NULL
                |
                |             vfs_read()
                |              inotify_fdinfo()
                |               * inode get from mark *
                |               show_mark_fhandle(m, inode)
                |                exportfs_encode_fid(inode, ..)
                |                 ovl_encode_fh(inode, ..)
                |                  ovl_check_encode_origin(inode)
                |                   * deref i_sb->s_root *
                |
                |
                v
fsnotify_sb_delete(sb)

Which then leads to:

[ 32.133461] Oops: general protection fault, probably for non-canonical address 0xdffffc0000000006: 0000 [#1] SMP DEBUG_PAGEALLOC KASAN NOPTI [ 32.134438] KASAN: null-ptr-deref in range [0x0000000000000030-0x0000000000000037] [ 32.135032] CPU: 1 UID: 0 PID: 4468 Comm: systemd-coredum Not tainted 6.17.0-rc6 #22 PREEMPT(none)

<snip registers, unreliable trace>

[ 32.143353] Call Trace: [ 32.143732] ovl_encode_fh+0xd5/0x170 [ 32.144031] exportfs_encode_inode_fh+0x12f/0x300 [ 32.144425] show_mark_fhandle+0xbe/0x1f0 [ 32.145805] inotify_fdinfo+0x226/0x2d0 [ 32.146442] inotify_show_fdinfo+0x1c5/0x350 [ 32.147168] seq_show+0x530/0x6f0 [ 32.147449] seq_read_iter+0x503/0x12a0 [ 32.148419] seq_read+0x31f/0x410 [ 32.150714] vfs_read+0x1f0/0x9e0 [ 32.152297] ksys_read+0x125/0x240

IOW ovl_check_encode_origin derefs inode->i_sb->s_root, after it was set to NULL in the unmount path.

Minimize the window of opportunity by adding explicit check.

Fixes: c45beebfde34 ("ovl: support encoding fid from inode with no alias") Signed-off-by: Jakub Acs acsjakub@amazon.de Cc: Miklos Szeredi miklos@szeredi.hu Cc: Amir Goldstein amir73il@gmail.com Cc: linux-unionfs@vger.kernel.org Cc: linux-kernel@vger.kernel.org Cc: stable@vger.kernel.org

I'm happy to take suggestions for a better fix - I looked at taking s_umount for reading, but it wasn't clear to me for how long would the fdinfo path need to hold it. Hence the most primitive suggestion in this v1.

I'm also not sure if ENOENT or EBUSY is better?.. or even something else?

fs/overlayfs/export.c | 2 ++ 1 file changed, 2 insertions(+)

diff --git a/fs/overlayfs/export.c b/fs/overlayfs/export.c index 83f80fdb1567..424c73188e06 100644 --- a/fs/overlayfs/export.c +++ b/fs/overlayfs/export.c @@ -195,6 +195,8 @@ static int ovl_check_encode_origin(struct inode *inode) if (!ovl_inode_lower(inode)) return 0;
  if (!inode->i_sb->s_root)
          return -ENOENT;

For a filesystem method to have to check that its own root is still alive sounds like the wrong way to me. That's one of the things that should be taken for granted by fs code.

I don't think this is an overlayfs specific issue, because other fs would be happy if encode_fh() would be called with NULL sb->s_root.

Jan,

Can we change the order of generic_shutdown_super() so that fsnotify_sb_delete(sb) is called before setting s_root to NULL?

Or is there a better solution for this race?

Thanks, Amir.

Jan Kara

2:07 p.m.

On Mon 15-09-25 15:01:13, Amir Goldstein wrote:

...

On Mon, Sep 15, 2025 at 12:15 PM Jakub Acs acsjakub@amazon.de wrote:

...
Calling intotify_show_fdinfo() on fd watching an overlayfs inode, while the overlayfs is being unmounted, can lead to dereferencing NULL ptr.

This issue was found by syzkaller.

Race Condition Diagram:

Thread 1 Thread 2

generic_shutdown_super() shrink_dcache_for_umount sb->s_root = NULL
                |
                |             vfs_read()
                |              inotify_fdinfo()
                |               * inode get from mark *
                |               show_mark_fhandle(m, inode)
                |                exportfs_encode_fid(inode, ..)
                |                 ovl_encode_fh(inode, ..)
                |                  ovl_check_encode_origin(inode)
                |                   * deref i_sb->s_root *
                |
                |
                v
fsnotify_sb_delete(sb)

Which then leads to:

[ 32.133461] Oops: general protection fault, probably for non-canonical address 0xdffffc0000000006: 0000 [#1] SMP DEBUG_PAGEALLOC KASAN NOPTI [ 32.134438] KASAN: null-ptr-deref in range [0x0000000000000030-0x0000000000000037] [ 32.135032] CPU: 1 UID: 0 PID: 4468 Comm: systemd-coredum Not tainted 6.17.0-rc6 #22 PREEMPT(none)

<snip registers, unreliable trace>

[ 32.143353] Call Trace: [ 32.143732] ovl_encode_fh+0xd5/0x170 [ 32.144031] exportfs_encode_inode_fh+0x12f/0x300 [ 32.144425] show_mark_fhandle+0xbe/0x1f0 [ 32.145805] inotify_fdinfo+0x226/0x2d0 [ 32.146442] inotify_show_fdinfo+0x1c5/0x350 [ 32.147168] seq_show+0x530/0x6f0 [ 32.147449] seq_read_iter+0x503/0x12a0 [ 32.148419] seq_read+0x31f/0x410 [ 32.150714] vfs_read+0x1f0/0x9e0 [ 32.152297] ksys_read+0x125/0x240

IOW ovl_check_encode_origin derefs inode->i_sb->s_root, after it was set to NULL in the unmount path.

Minimize the window of opportunity by adding explicit check.

Fixes: c45beebfde34 ("ovl: support encoding fid from inode with no alias") Signed-off-by: Jakub Acs acsjakub@amazon.de Cc: Miklos Szeredi miklos@szeredi.hu Cc: Amir Goldstein amir73il@gmail.com Cc: linux-unionfs@vger.kernel.org Cc: linux-kernel@vger.kernel.org Cc: stable@vger.kernel.org

I'm happy to take suggestions for a better fix - I looked at taking s_umount for reading, but it wasn't clear to me for how long would the fdinfo path need to hold it. Hence the most primitive suggestion in this v1.

I'm also not sure if ENOENT or EBUSY is better?.. or even something else?

fs/overlayfs/export.c | 2 ++ 1 file changed, 2 insertions(+)

diff --git a/fs/overlayfs/export.c b/fs/overlayfs/export.c index 83f80fdb1567..424c73188e06 100644 --- a/fs/overlayfs/export.c +++ b/fs/overlayfs/export.c @@ -195,6 +195,8 @@ static int ovl_check_encode_origin(struct inode *inode) if (!ovl_inode_lower(inode)) return 0;
  if (!inode->i_sb->s_root)
          return -ENOENT;
For a filesystem method to have to check that its own root is still alive sounds like the wrong way to me. That's one of the things that should be taken for granted by fs code.

I don't think this is an overlayfs specific issue, because other fs would be happy if encode_fh() would be called with NULL sb->s_root.

Actually, I don't see where that would blow up? Generally references to sb->s_root in filesystems outside of mount / remount code are pretty rare. Also most of the code should be unreachable by the time we set sb->s_root to NULL because there are no open files at that moment, no exports etc. But as this report shows, there are occasional surprises (I remember similar issue with ext4 sysfs files handlers using s_root without checking couple years back).

...

Jan,

Can we change the order of generic_shutdown_super() so that fsnotify_sb_delete(sb) is called before setting s_root to NULL?

Or is there a better solution for this race?

Regarding calling fsnotify_sb_delete() before setting s_root to NULL: In 2019 (commit 1edc8eb2e9313 ("fs: call fsnotify_sb_delete after evict_inodes")) we've moved the call after evict_inodes() because otherwise we were just wasting cycles scanning many inodes without watches. So moving it earlier wouldn't be great...

Honza

-- Jan Kara jack@suse.com SUSE Labs, CR

Amir Goldstein

3:29 p.m.

On Mon, Sep 15, 2025 at 4:07 PM Jan Kara jack@suse.cz wrote:

...

On Mon 15-09-25 15:01:13, Amir Goldstein wrote:

...
On Mon, Sep 15, 2025 at 12:15 PM Jakub Acs acsjakub@amazon.de wrote:

...
Calling intotify_show_fdinfo() on fd watching an overlayfs inode, while the overlayfs is being unmounted, can lead to dereferencing NULL ptr.

This issue was found by syzkaller.

Race Condition Diagram:

Thread 1 Thread 2

generic_shutdown_super() shrink_dcache_for_umount sb->s_root = NULL
                |
                |             vfs_read()
                |              inotify_fdinfo()
                |               * inode get from mark *
                |               show_mark_fhandle(m, inode)
                |                exportfs_encode_fid(inode, ..)
                |                 ovl_encode_fh(inode, ..)
                |                  ovl_check_encode_origin(inode)
                |                   * deref i_sb->s_root *
                |
                |
                v
fsnotify_sb_delete(sb)

Which then leads to:

[ 32.133461] Oops: general protection fault, probably for non-canonical address 0xdffffc0000000006: 0000 [#1] SMP DEBUG_PAGEALLOC KASAN NOPTI [ 32.134438] KASAN: null-ptr-deref in range [0x0000000000000030-0x0000000000000037] [ 32.135032] CPU: 1 UID: 0 PID: 4468 Comm: systemd-coredum Not tainted 6.17.0-rc6 #22 PREEMPT(none)

<snip registers, unreliable trace>

[ 32.143353] Call Trace: [ 32.143732] ovl_encode_fh+0xd5/0x170 [ 32.144031] exportfs_encode_inode_fh+0x12f/0x300 [ 32.144425] show_mark_fhandle+0xbe/0x1f0 [ 32.145805] inotify_fdinfo+0x226/0x2d0 [ 32.146442] inotify_show_fdinfo+0x1c5/0x350 [ 32.147168] seq_show+0x530/0x6f0 [ 32.147449] seq_read_iter+0x503/0x12a0 [ 32.148419] seq_read+0x31f/0x410 [ 32.150714] vfs_read+0x1f0/0x9e0 [ 32.152297] ksys_read+0x125/0x240

IOW ovl_check_encode_origin derefs inode->i_sb->s_root, after it was set to NULL in the unmount path.

Minimize the window of opportunity by adding explicit check.

Fixes: c45beebfde34 ("ovl: support encoding fid from inode with no alias") Signed-off-by: Jakub Acs acsjakub@amazon.de Cc: Miklos Szeredi miklos@szeredi.hu Cc: Amir Goldstein amir73il@gmail.com Cc: linux-unionfs@vger.kernel.org Cc: linux-kernel@vger.kernel.org Cc: stable@vger.kernel.org

I'm happy to take suggestions for a better fix - I looked at taking s_umount for reading, but it wasn't clear to me for how long would the fdinfo path need to hold it. Hence the most primitive suggestion in this v1.

I'm also not sure if ENOENT or EBUSY is better?.. or even something else?

fs/overlayfs/export.c | 2 ++ 1 file changed, 2 insertions(+)

diff --git a/fs/overlayfs/export.c b/fs/overlayfs/export.c index 83f80fdb1567..424c73188e06 100644 --- a/fs/overlayfs/export.c +++ b/fs/overlayfs/export.c @@ -195,6 +195,8 @@ static int ovl_check_encode_origin(struct inode *inode) if (!ovl_inode_lower(inode)) return 0;
  if (!inode->i_sb->s_root)
          return -ENOENT;
For a filesystem method to have to check that its own root is still alive sounds like the wrong way to me. That's one of the things that should be taken for granted by fs code.

I don't think this is an overlayfs specific issue, because other fs would be happy if encode_fh() would be called with NULL sb->s_root.
Actually, I don't see where that would blow up? Generally references to sb->s_root in filesystems outside of mount / remount code are pretty rare. Also most of the code should be unreachable by the time we set sb->s_root to NULL because there are no open files at that moment, no exports etc. But as this report shows, there are occasional surprises (I remember similar issue with ext4 sysfs files handlers using s_root without checking couple years back).

I am not sure that I understand what you are arguing for. I did a very naive grep s_root fs/*/export.c and quickly found:

static int gfs2_encode_fh(struct inode *inode, __u32 *p, int *len, struct inode *parent) { ... if (!parent || inode == d_inode(sb->s_root)) return *len;

So it's not an overlayfs specific issue, just so happens that zysbot likes to test overlayfs.

Are you suggesting that we fix all of those one by one?

...

...
Jan,

Can we change the order of generic_shutdown_super() so that fsnotify_sb_delete(sb) is called before setting s_root to NULL?

Or is there a better solution for this race?

Regarding calling fsnotify_sb_delete() before setting s_root to NULL: In 2019 (commit 1edc8eb2e9313 ("fs: call fsnotify_sb_delete after evict_inodes")) we've moved the call after evict_inodes() because otherwise we were just wasting cycles scanning many inodes without watches. So moving it earlier wouldn't be great...

Yes, I noticed that and I figured there were subtleties.

In any case, Jakub, your patch is insufficient because: 1. Checking sb->sb_root without a lock and without READ_ONCE() and a matching WRITE_ONCE() is not safe 2. sb_root can become NULL after the check since you are not holding the s_umount lock

Jakub,

Instead of an unsafe check inside ovl_encode_fh(), I think it is better to use super_trylock_shared() inside show_mark_fhandle() before calling exportfs_encode_fid()?

Feels like the corner case is show_mark_fhandle() and there is no strong incentive to make this code very efficient.

Jan, WDYT?

Thanks, Amir.

Jan Kara

16 Sep 16 Sep

11:30 a.m.

On Mon 15-09-25 17:29:40, Amir Goldstein wrote:

...

On Mon, Sep 15, 2025 at 4:07 PM Jan Kara jack@suse.cz wrote:

...
...
...
diff --git a/fs/overlayfs/export.c b/fs/overlayfs/export.c index 83f80fdb1567..424c73188e06 100644 --- a/fs/overlayfs/export.c +++ b/fs/overlayfs/export.c @@ -195,6 +195,8 @@ static int ovl_check_encode_origin(struct inode *inode) if (!ovl_inode_lower(inode)) return 0;
  if (!inode->i_sb->s_root)
          return -ENOENT;
For a filesystem method to have to check that its own root is still alive sounds like the wrong way to me. That's one of the things that should be taken for granted by fs code.

I don't think this is an overlayfs specific issue, because other fs would be happy if encode_fh() would be called with NULL sb->s_root.
Actually, I don't see where that would blow up? Generally references to sb->s_root in filesystems outside of mount / remount code are pretty rare. Also most of the code should be unreachable by the time we set sb->s_root to NULL because there are no open files at that moment, no exports etc. But as this report shows, there are occasional surprises (I remember similar issue with ext4 sysfs files handlers using s_root without checking couple years back).
I am not sure that I understand what you are arguing for. I did a very naive grep s_root fs/*/export.c and quickly found:

You're better with grep than me ;). I was grepping for '->s_root' as well but all the hits I had looked into were related to mounting and similar and eventually I got bored. Restricting the grep to export ops indeed shows ceph, gfs2 and overlayfs are vulnerable to this kind of problem.

...

static int gfs2_encode_fh(struct inode *inode, __u32 *p, int *len, struct inode *parent) { ... if (!parent || inode == d_inode(sb->s_root)) return *len;

So it's not an overlayfs specific issue, just so happens that zysbot likes to test overlayfs.

Are you suggesting that we fix all of those one by one?

No. I agree we need to figure out a way to make sure export ops are not called on a filesystem being unmounted. Standard open_by_handle() or NFS export cannot race with generic_shutdown_super() (they hold the fs mounted) so fsnotify is a special case here.

I actually wonder if fanotify event (e.g. from inode deletion postponed to some workqueue or whatever) cannot race with umount as well and cause the same problem...

...

...
...
Can we change the order of generic_shutdown_super() so that fsnotify_sb_delete(sb) is called before setting s_root to NULL?

Or is there a better solution for this race?

Regarding calling fsnotify_sb_delete() before setting s_root to NULL: In 2019 (commit 1edc8eb2e9313 ("fs: call fsnotify_sb_delete after evict_inodes")) we've moved the call after evict_inodes() because otherwise we were just wasting cycles scanning many inodes without watches. So moving it earlier wouldn't be great...

Yes, I noticed that and I figured there were subtleties.

Right. After thinking more about it I think calling fsnotify_sb_delete() earlier is the only practical choice we have (not clearing sb->s_root isn't much of an option - we need to prune all dentries to quiesce the filesystem and leaving s_root alive would create odd corner cases). But you don't want to be iterating millions of inodes just to clear couple of marks so we'll have to figure out something more clever there.

Honza

-- Jan Kara jack@suse.com SUSE Labs, CR

Amir Goldstein

1:29 p.m.

On Tue, Sep 16, 2025 at 1:30 PM Jan Kara jack@suse.cz wrote:

...

On Mon 15-09-25 17:29:40, Amir Goldstein wrote:

...
On Mon, Sep 15, 2025 at 4:07 PM Jan Kara jack@suse.cz wrote:

...
...
...
diff --git a/fs/overlayfs/export.c b/fs/overlayfs/export.c index 83f80fdb1567..424c73188e06 100644 --- a/fs/overlayfs/export.c +++ b/fs/overlayfs/export.c @@ -195,6 +195,8 @@ static int ovl_check_encode_origin(struct inode *inode) if (!ovl_inode_lower(inode)) return 0;
  if (!inode->i_sb->s_root)
          return -ENOENT;
For a filesystem method to have to check that its own root is still alive sounds like the wrong way to me. That's one of the things that should be taken for granted by fs code.

I don't think this is an overlayfs specific issue, because other fs would be happy if encode_fh() would be called with NULL sb->s_root.
Actually, I don't see where that would blow up? Generally references to sb->s_root in filesystems outside of mount / remount code are pretty rare. Also most of the code should be unreachable by the time we set sb->s_root to NULL because there are no open files at that moment, no exports etc. But as this report shows, there are occasional surprises (I remember similar issue with ext4 sysfs files handlers using s_root without checking couple years back).
I am not sure that I understand what you are arguing for. I did a very naive grep s_root fs/*/export.c and quickly found:
You're better with grep than me ;). I was grepping for '->s_root' as well but all the hits I had looked into were related to mounting and similar and eventually I got bored. Restricting the grep to export ops indeed shows ceph, gfs2 and overlayfs are vulnerable to this kind of problem.

...
static int gfs2_encode_fh(struct inode *inode, __u32 *p, int *len, struct inode *parent) { ... if (!parent || inode == d_inode(sb->s_root)) return *len;

So it's not an overlayfs specific issue, just so happens that zysbot likes to test overlayfs.

Are you suggesting that we fix all of those one by one?

No. I agree we need to figure out a way to make sure export ops are not called on a filesystem being unmounted. Standard open_by_handle() or NFS export cannot race with generic_shutdown_super() (they hold the fs mounted) so fsnotify is a special case here.

I actually wonder if fanotify event (e.g. from inode deletion postponed to some workqueue or whatever) cannot race with umount as well and cause the same problem...

Oy. I was thinking that all event happen when holding some mnt ref but yeh fsnotify_inoderemove() does look like it could be a problem from sb shutdown context.

How about skipping fsnotify_inoderemove() in case sb is in shutdown?

...

...
...
...
Can we change the order of generic_shutdown_super() so that fsnotify_sb_delete(sb) is called before setting s_root to NULL?

Or is there a better solution for this race?

Regarding calling fsnotify_sb_delete() before setting s_root to NULL: In 2019 (commit 1edc8eb2e9313 ("fs: call fsnotify_sb_delete after evict_inodes")) we've moved the call after evict_inodes() because otherwise we were just wasting cycles scanning many inodes without watches. So moving it earlier wouldn't be great...

Yes, I noticed that and I figured there were subtleties.

Right. After thinking more about it I think calling fsnotify_sb_delete() earlier is the only practical choice we have (not clearing sb->s_root isn't much of an option - we need to prune all dentries to quiesce the filesystem and leaving s_root alive would create odd corner cases). But you don't want to be iterating millions of inodes just to clear couple of marks so we'll have to figure out something more clever there.

I think we only need to suppress the fsnotify_inoderemove() call. It sounds doable and very local to fs/super.c.

Regarding show_mark_fhandle() WDYT about my suggestion to guard it with super_trylock_shared()?

Thanks, Amir.

Jan Kara

17 Sep 17 Sep

9:25 a.m.

On Tue 16-09-25 15:29:35, Amir Goldstein wrote:

...

On Tue, Sep 16, 2025 at 1:30 PM Jan Kara jack@suse.cz wrote:

...
On Mon 15-09-25 17:29:40, Amir Goldstein wrote:

...
On Mon, Sep 15, 2025 at 4:07 PM Jan Kara jack@suse.cz wrote:

...
...
...
diff --git a/fs/overlayfs/export.c b/fs/overlayfs/export.c index 83f80fdb1567..424c73188e06 100644 --- a/fs/overlayfs/export.c +++ b/fs/overlayfs/export.c @@ -195,6 +195,8 @@ static int ovl_check_encode_origin(struct inode *inode) if (!ovl_inode_lower(inode)) return 0;
  if (!inode->i_sb->s_root)
          return -ENOENT;
For a filesystem method to have to check that its own root is still alive sounds like the wrong way to me. That's one of the things that should be taken for granted by fs code.

I don't think this is an overlayfs specific issue, because other fs would be happy if encode_fh() would be called with NULL sb->s_root.
Actually, I don't see where that would blow up? Generally references to sb->s_root in filesystems outside of mount / remount code are pretty rare. Also most of the code should be unreachable by the time we set sb->s_root to NULL because there are no open files at that moment, no exports etc. But as this report shows, there are occasional surprises (I remember similar issue with ext4 sysfs files handlers using s_root without checking couple years back).
I am not sure that I understand what you are arguing for. I did a very naive grep s_root fs/*/export.c and quickly found:
You're better with grep than me ;). I was grepping for '->s_root' as well but all the hits I had looked into were related to mounting and similar and eventually I got bored. Restricting the grep to export ops indeed shows ceph, gfs2 and overlayfs are vulnerable to this kind of problem.

...
static int gfs2_encode_fh(struct inode *inode, __u32 *p, int *len, struct inode *parent) { ... if (!parent || inode == d_inode(sb->s_root)) return *len;

So it's not an overlayfs specific issue, just so happens that zysbot likes to test overlayfs.

Are you suggesting that we fix all of those one by one?

No. I agree we need to figure out a way to make sure export ops are not called on a filesystem being unmounted. Standard open_by_handle() or NFS export cannot race with generic_shutdown_super() (they hold the fs mounted) so fsnotify is a special case here.

I actually wonder if fanotify event (e.g. from inode deletion postponed to some workqueue or whatever) cannot race with umount as well and cause the same problem...
Oy. I was thinking that all event happen when holding some mnt ref but yeh fsnotify_inoderemove() does look like it could be a problem from sb shutdown context.

Well, but there's also fun like fs/kernfs/file.c: kernfs_notify() which queues work which calls fsnotify for some inodes and, frankly, proper exclusion with umount seems non-existent there (but I can be missing something).

Also we have fsnotify_sb_error() which can happen practically anytime before the fs gets fully shutdown in ->kill_sb() and may try to encode fh of an inode.

So there are not many cases where this can happen but enough that I'd say that handling some events specially to avoid encoding fh on fs while it is unmounted is fragile and prone to breaking again sooner or later.

...

How about skipping fsnotify_inoderemove() in case sb is in shutdown?

Also how would you like to handle that in a race-free manner? We'd need to hold s_umount for that which we cannot really afford in that context. But maybe you have some better idea...

...

...
...
...
...
Can we change the order of generic_shutdown_super() so that fsnotify_sb_delete(sb) is called before setting s_root to NULL?

Or is there a better solution for this race?

Regarding calling fsnotify_sb_delete() before setting s_root to NULL: In 2019 (commit 1edc8eb2e9313 ("fs: call fsnotify_sb_delete after evict_inodes")) we've moved the call after evict_inodes() because otherwise we were just wasting cycles scanning many inodes without watches. So moving it earlier wouldn't be great...

Yes, I noticed that and I figured there were subtleties.

Right. After thinking more about it I think calling fsnotify_sb_delete() earlier is the only practical choice we have (not clearing sb->s_root isn't much of an option - we need to prune all dentries to quiesce the filesystem and leaving s_root alive would create odd corner cases). But you don't want to be iterating millions of inodes just to clear couple of marks so we'll have to figure out something more clever there.

I think we only need to suppress the fsnotify_inoderemove() call. It sounds doable and very local to fs/super.c.

Regarding show_mark_fhandle() WDYT about my suggestion to guard it with super_trylock_shared()?

Yes, super_trylock_shared() for that callsite looks like a fine solution for that call site. Occasional random failures in encoding fh because the trylock fails are unlikely to have any bad consequences there. But I think we need to figure out other possibly racing call-sites as well first.

Honza

-- Jan Kara jack@suse.com SUSE Labs, CR

Amir Goldstein

11:07 a.m.

On Wed, Sep 17, 2025 at 11:25 AM Jan Kara jack@suse.cz wrote:

...

On Tue 16-09-25 15:29:35, Amir Goldstein wrote:

...
On Tue, Sep 16, 2025 at 1:30 PM Jan Kara jack@suse.cz wrote:

...
On Mon 15-09-25 17:29:40, Amir Goldstein wrote:

...
On Mon, Sep 15, 2025 at 4:07 PM Jan Kara jack@suse.cz wrote:

...
...
> diff --git a/fs/overlayfs/export.c b/fs/overlayfs/export.c > index 83f80fdb1567..424c73188e06 100644 > --- a/fs/overlayfs/export.c > +++ b/fs/overlayfs/export.c > @@ -195,6 +195,8 @@ static int ovl_check_encode_origin(struct inode *inode) > if (!ovl_inode_lower(inode)) > return 0; > > + if (!inode->i_sb->s_root) > + return -ENOENT;

For a filesystem method to have to check that its own root is still alive sounds like the wrong way to me. That's one of the things that should be taken for granted by fs code.

I don't think this is an overlayfs specific issue, because other fs would be happy if encode_fh() would be called with NULL sb->s_root.

Actually, I don't see where that would blow up? Generally references to sb->s_root in filesystems outside of mount / remount code are pretty rare. Also most of the code should be unreachable by the time we set sb->s_root to NULL because there are no open files at that moment, no exports etc. But as this report shows, there are occasional surprises (I remember similar issue with ext4 sysfs files handlers using s_root without checking couple years back).

I am not sure that I understand what you are arguing for. I did a very naive grep s_root fs/*/export.c and quickly found:

You're better with grep than me ;). I was grepping for '->s_root' as well but all the hits I had looked into were related to mounting and similar and eventually I got bored. Restricting the grep to export ops indeed shows ceph, gfs2 and overlayfs are vulnerable to this kind of problem.

As far as I can tell, ceph uses s_root only in decode_fh methods.

ovl and gfs2 only want to know for an inode if it is the root inode, they do not strictly need to dereference s_root for that purpose. (see patch below)

...

...
...
...
static int gfs2_encode_fh(struct inode *inode, __u32 *p, int *len, struct inode *parent) { ... if (!parent || inode == d_inode(sb->s_root)) return *len;

So it's not an overlayfs specific issue, just so happens that zysbot likes to test overlayfs.

Are you suggesting that we fix all of those one by one?

No. I agree we need to figure out a way to make sure export ops are not called on a filesystem being unmounted. Standard open_by_handle() or NFS export cannot race with generic_shutdown_super() (they hold the fs mounted) so fsnotify is a special case here.

I actually wonder if fanotify event (e.g. from inode deletion postponed to some workqueue or whatever) cannot race with umount as well and cause the same problem...

Oy. I was thinking that all event happen when holding some mnt ref but yeh fsnotify_inoderemove() does look like it could be a problem from sb shutdown context.

Well, but there's also fun like fs/kernfs/file.c: kernfs_notify() which queues work which calls fsnotify for some inodes and, frankly, proper exclusion with umount seems non-existent there (but I can be missing something).

Ouch!

...

Also we have fsnotify_sb_error() which can happen practically anytime before the fs gets fully shutdown in ->kill_sb() and may try to encode fh of an inode.

Bigger ouch because silencing this event is not an option.

...

So there are not many cases where this can happen but enough that I'd say that handling some events specially to avoid encoding fh on fs while it is unmounted is fragile and prone to breaking again sooner or later.

...
How about skipping fsnotify_inoderemove() in case sb is in shutdown?

Also how would you like to handle that in a race-free manner? We'd need to hold s_umount for that which we cannot really afford in that context. But maybe you have some better idea...

I was only thinking about this code path:

generic_shutdown_super() shrink_dcache_for_umount() ... __dentry_kill() dentry_unlink_inode()

This is supposed to be the last dput of all remaining dentries and I don't think a deferred unlink should be expected in that case.

But I realize now that you mean delayed unlink from another context which races with shutdown.

...

...
...
...
...
...
Can we change the order of generic_shutdown_super() so that fsnotify_sb_delete(sb) is called before setting s_root to NULL?

Or is there a better solution for this race?

Regarding calling fsnotify_sb_delete() before setting s_root to NULL: In 2019 (commit 1edc8eb2e9313 ("fs: call fsnotify_sb_delete after evict_inodes")) we've moved the call after evict_inodes() because otherwise we were just wasting cycles scanning many inodes without watches. So moving it earlier wouldn't be great...

Yes, I noticed that and I figured there were subtleties.

Right. After thinking more about it I think calling fsnotify_sb_delete() earlier is the only practical choice we have (not clearing sb->s_root isn't much of an option - we need to prune all dentries to quiesce the filesystem and leaving s_root alive would create odd corner cases). But you don't want to be iterating millions of inodes just to clear couple of marks so we'll have to figure out something more clever there.

I think we only need to suppress the fsnotify_inoderemove() call. It sounds doable and very local to fs/super.c.

Regarding show_mark_fhandle() WDYT about my suggestion to guard it with super_trylock_shared()?

Yes, super_trylock_shared() for that callsite looks like a fine solution for that call site. Occasional random failures in encoding fh because the trylock fails are unlikely to have any bad consequences there. But I think we need to figure out other possibly racing call-sites as well first.

Might something naive as this be enough?

Thanks, Amir.

diff --git a/fs/dcache.c b/fs/dcache.c index 60046ae23d514..8c9d0d6bb0045 100644 --- a/fs/dcache.c +++ b/fs/dcache.c @@ -1999,10 +1999,12 @@ struct dentry *d_make_root(struct inode *root_inode)

if (root_inode) { res = d_alloc_anon(root_inode->i_sb); - if (res) + if (res) { + root_inode->i_opflags |= IOP_ROOT; d_instantiate(res, root_inode); - else + } else { iput(root_inode); + } } return res; } diff --git a/fs/gfs2/export.c b/fs/gfs2/export.c index 3334c394ce9cb..809a09c6a89e0 100644 --- a/fs/gfs2/export.c +++ b/fs/gfs2/export.c @@ -46,7 +46,7 @@ static int gfs2_encode_fh(struct inode *inode, __u32 *p, int *len, fh[3] = cpu_to_be32(ip->i_no_addr & 0xFFFFFFFF); *len = GFS2_SMALL_FH_SIZE;

- if (!parent || inode == d_inode(sb->s_root)) + if (!parent || is_root_inode(inode)) return *len;

ip = GFS2_I(parent); diff --git a/fs/overlayfs/export.c b/fs/overlayfs/export.c index 83f80fdb15674..7827c63354ad5 100644 --- a/fs/overlayfs/export.c +++ b/fs/overlayfs/export.c @@ -199,7 +199,7 @@ static int ovl_check_encode_origin(struct inode *inode) * Root is never indexed, so if there's an upper layer, encode upper for * root. */ - if (inode == d_inode(inode->i_sb->s_root)) + if (is_root_inode(inode)) return 0;

/* diff --git a/include/linux/fs.h b/include/linux/fs.h index ec867f112fd5f..ed84379aa06ca 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -665,6 +665,7 @@ is_uncached_acl(struct posix_acl *acl) #define IOP_DEFAULT_READLINK 0x0010 #define IOP_MGTIME 0x0020 #define IOP_CACHED_LINK 0x0040 +#define IOP_ROOT 0x0080 /* * Keep mostly read-only and often accessed (especially for @@ -2713,6 +2714,11 @@ static inline bool is_mgtime(const struct inode *inode) return inode->i_opflags & IOP_MGTIME; }

+static inline bool is_root_inode(const struct inode *inode) +{ + return inode->i_opflags & IOP_ROOT; +} + extern struct dentry *mount_bdev(struct file_system_type *fs_type, int flags, const char *dev_name, void *data, int (*fill_super)(struct super_block *, void *, int));

Jakub Acs

11:36 a.m.

On Wed, Sep 17, 2025 at 01:07:45PM +0200, Amir Goldstein wrote:

...

Might something naive as this be enough?

Thanks, Amir.

diff --git a/fs/dcache.c b/fs/dcache.c index 60046ae23d514..8c9d0d6bb0045 100644 --- a/fs/dcache.c +++ b/fs/dcache.c @@ -1999,10 +1999,12 @@ struct dentry *d_make_root(struct inode *root_inode)
    if (root_inode) {
            res = d_alloc_anon(root_inode->i_sb);
          if (res)
          if (res) {
                  root_inode->i_opflags |= IOP_ROOT;
                  d_instantiate(res, root_inode);
          else
          } else {
                  iput(root_inode);
          }
  }
  return res;
} diff --git a/fs/gfs2/export.c b/fs/gfs2/export.c index 3334c394ce9cb..809a09c6a89e0 100644 --- a/fs/gfs2/export.c +++ b/fs/gfs2/export.c @@ -46,7 +46,7 @@ static int gfs2_encode_fh(struct inode *inode, __u32 *p, int *len, fh[3] = cpu_to_be32(ip->i_no_addr & 0xFFFFFFFF); *len = GFS2_SMALL_FH_SIZE;
  if (!parent || inode == d_inode(sb->s_root))
  if (!parent || is_root_inode(inode))
          return *len;

  ip = GFS2_I(parent);
diff --git a/fs/overlayfs/export.c b/fs/overlayfs/export.c index 83f80fdb15674..7827c63354ad5 100644 --- a/fs/overlayfs/export.c +++ b/fs/overlayfs/export.c @@ -199,7 +199,7 @@ static int ovl_check_encode_origin(struct inode *inode) * Root is never indexed, so if there's an upper layer, encode upper for * root. */
  if (inode == d_inode(inode->i_sb->s_root))
  if (is_root_inode(inode))
          return 0;

  /*
diff --git a/include/linux/fs.h b/include/linux/fs.h index ec867f112fd5f..ed84379aa06ca 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -665,6 +665,7 @@ is_uncached_acl(struct posix_acl *acl) #define IOP_DEFAULT_READLINK 0x0010 #define IOP_MGTIME 0x0020 #define IOP_CACHED_LINK 0x0040 +#define IOP_ROOT 0x0080 /*

Keep mostly read-only and often accessed (especially for

@@ -2713,6 +2714,11 @@ static inline bool is_mgtime(const struct inode *inode) return inode->i_opflags & IOP_MGTIME; }

+static inline bool is_root_inode(const struct inode *inode) +{
  return inode->i_opflags & IOP_ROOT;
+}

extern struct dentry *mount_bdev(struct file_system_type *fs_type, int flags, const char *dev_name, void *data, int (*fill_super)(struct super_block *, void *, int));

This would prevent the null-ptr-deref, but the encoding procedure would continue (for non-root inode), potentially reaching other code paths that assume fs is still mounted - could that maybe be a problem?

I had considered similar direction initially, too, but then decided I'm unable to verify the paths and that it's safer to just fail if we detect no root (or cannot take the lock).

Am I thinking wrong?

Jakub

Amazon Web Services Development Center Germany GmbH Tamara-Danz-Str. 13 10243 Berlin Geschaeftsfuehrung: Christian Schlaeger, Jonathan Weiss Eingetragen am Amtsgericht Charlottenburg unter HRB 257764 B Sitz: Berlin Ust-ID: DE 365 538 597

Jan Kara

2:42 p.m.

On Wed 17-09-25 13:07:45, Amir Goldstein wrote:

...

On Wed, Sep 17, 2025 at 11:25 AM Jan Kara jack@suse.cz wrote:

...
On Tue 16-09-25 15:29:35, Amir Goldstein wrote:

...
On Tue, Sep 16, 2025 at 1:30 PM Jan Kara jack@suse.cz wrote:

...
On Mon 15-09-25 17:29:40, Amir Goldstein wrote:

...
On Mon, Sep 15, 2025 at 4:07 PM Jan Kara jack@suse.cz wrote:

...
> > diff --git a/fs/overlayfs/export.c b/fs/overlayfs/export.c > > index 83f80fdb1567..424c73188e06 100644 > > --- a/fs/overlayfs/export.c > > +++ b/fs/overlayfs/export.c > > @@ -195,6 +195,8 @@ static int ovl_check_encode_origin(struct inode *inode) > > if (!ovl_inode_lower(inode)) > > return 0; > > > > + if (!inode->i_sb->s_root) > > + return -ENOENT; > > For a filesystem method to have to check that its own root is still alive sounds > like the wrong way to me. > That's one of the things that should be taken for granted by fs code. > > I don't think this is an overlayfs specific issue, because other fs would be > happy if encode_fh() would be called with NULL sb->s_root.

Actually, I don't see where that would blow up? Generally references to sb->s_root in filesystems outside of mount / remount code are pretty rare. Also most of the code should be unreachable by the time we set sb->s_root to NULL because there are no open files at that moment, no exports etc. But as this report shows, there are occasional surprises (I remember similar issue with ext4 sysfs files handlers using s_root without checking couple years back).

I am not sure that I understand what you are arguing for. I did a very naive grep s_root fs/*/export.c and quickly found:

You're better with grep than me ;). I was grepping for '->s_root' as well but all the hits I had looked into were related to mounting and similar and eventually I got bored. Restricting the grep to export ops indeed shows ceph, gfs2 and overlayfs are vulnerable to this kind of problem.

As far as I can tell, ceph uses s_root only in decode_fh methods.

True. But ceph also uses d_find_alias() in ceph_encode_snapfh() which could race with shrink_dcache_for_umount()->do_one_tree() and trigger:

WARN(1, "BUG: Dentry %p{i=%lx,n=%pd} " " still in use (%d) [unmount of %s %s]\n",

...

ovl and gfs2 only want to know for an inode if it is the root inode, they do not strictly need to dereference s_root for that purpose. (see patch below)

...
So there are not many cases where this can happen but enough that I'd say that handling some events specially to avoid encoding fh on fs while it is unmounted is fragile and prone to breaking again sooner or later.

...
How about skipping fsnotify_inoderemove() in case sb is in shutdown?

Also how would you like to handle that in a race-free manner? We'd need to hold s_umount for that which we cannot really afford in that context. But maybe you have some better idea...

I was only thinking about this code path:

generic_shutdown_super() shrink_dcache_for_umount() ... __dentry_kill() dentry_unlink_inode()

This is supposed to be the last dput of all remaining dentries and I don't think a deferred unlink should be expected in that case.

I see.

...

But I realize now that you mean delayed unlink from another context which races with shutdown.

Yes, I've meant that.

...

...
...
...
...
...
> Can we change the order of generic_shutdown_super() so that > fsnotify_sb_delete(sb) is called before setting s_root to NULL? > > Or is there a better solution for this race?

Regarding calling fsnotify_sb_delete() before setting s_root to NULL: In 2019 (commit 1edc8eb2e9313 ("fs: call fsnotify_sb_delete after evict_inodes")) we've moved the call after evict_inodes() because otherwise we were just wasting cycles scanning many inodes without watches. So moving it earlier wouldn't be great...

Yes, I noticed that and I figured there were subtleties.

Right. After thinking more about it I think calling fsnotify_sb_delete() earlier is the only practical choice we have (not clearing sb->s_root isn't much of an option - we need to prune all dentries to quiesce the filesystem and leaving s_root alive would create odd corner cases). But you don't want to be iterating millions of inodes just to clear couple of marks so we'll have to figure out something more clever there.

I think we only need to suppress the fsnotify_inoderemove() call. It sounds doable and very local to fs/super.c.

Regarding show_mark_fhandle() WDYT about my suggestion to guard it with super_trylock_shared()?

Yes, super_trylock_shared() for that callsite looks like a fine solution for that call site. Occasional random failures in encoding fh because the trylock fails are unlikely to have any bad consequences there. But I think we need to figure out other possibly racing call-sites as well first.

Might something naive as this be enough?

It looks like it should be good for the problems with gfs2 & overlayfs but it doesn't solve the problem with ceph and as Jakub writes there's a question whether we won't hit more problems later.

I'm sorry for poking holes into your solutions. The more I look into this the more problems I find :-|

As I'm thinking about it I'm slowly leaning towards implementing a list of connectors per sb (so that we can quickly reclaim on umount). It seems stupid just for these corner cases but longer term we can also implement what Dave once suggested [1] so that fsnotify doesn't need to pin inodes in memory at all which should more that make up for the additional memory for inode connector members.

Honza

[1] https://lore.kernel.org/linux-fsdevel/ZwXDzKGj6Bp28kYe@dread.disaster.area/

...

diff --git a/fs/dcache.c b/fs/dcache.c index 60046ae23d514..8c9d0d6bb0045 100644 --- a/fs/dcache.c +++ b/fs/dcache.c @@ -1999,10 +1999,12 @@ struct dentry *d_make_root(struct inode *root_inode)
    if (root_inode) {
            res = d_alloc_anon(root_inode->i_sb);
          if (res)
          if (res) {
                  root_inode->i_opflags |= IOP_ROOT;
                  d_instantiate(res, root_inode);
          else
          } else {
                  iput(root_inode);
          }
  }
  return res;
} diff --git a/fs/gfs2/export.c b/fs/gfs2/export.c index 3334c394ce9cb..809a09c6a89e0 100644 --- a/fs/gfs2/export.c +++ b/fs/gfs2/export.c @@ -46,7 +46,7 @@ static int gfs2_encode_fh(struct inode *inode, __u32 *p, int *len, fh[3] = cpu_to_be32(ip->i_no_addr & 0xFFFFFFFF); *len = GFS2_SMALL_FH_SIZE;
  if (!parent || inode == d_inode(sb->s_root))
  if (!parent || is_root_inode(inode))
          return *len;

  ip = GFS2_I(parent);
diff --git a/fs/overlayfs/export.c b/fs/overlayfs/export.c index 83f80fdb15674..7827c63354ad5 100644 --- a/fs/overlayfs/export.c +++ b/fs/overlayfs/export.c @@ -199,7 +199,7 @@ static int ovl_check_encode_origin(struct inode *inode) * Root is never indexed, so if there's an upper layer, encode upper for * root. */
  if (inode == d_inode(inode->i_sb->s_root))
  if (is_root_inode(inode))
          return 0;

  /*
diff --git a/include/linux/fs.h b/include/linux/fs.h index ec867f112fd5f..ed84379aa06ca 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -665,6 +665,7 @@ is_uncached_acl(struct posix_acl *acl) #define IOP_DEFAULT_READLINK 0x0010 #define IOP_MGTIME 0x0020 #define IOP_CACHED_LINK 0x0040 +#define IOP_ROOT 0x0080 /*

Keep mostly read-only and often accessed (especially for

@@ -2713,6 +2714,11 @@ static inline bool is_mgtime(const struct inode *inode) return inode->i_opflags & IOP_MGTIME; }

+static inline bool is_root_inode(const struct inode *inode) +{
  return inode->i_opflags & IOP_ROOT;
+}

extern struct dentry *mount_bdev(struct file_system_type *fs_type, int flags, const char *dev_name, void *data, int (*fill_super)(struct super_block *, void *, int));

-- Jan Kara jack@suse.com SUSE Labs, CR

Amir Goldstein

5:51 p.m.

On Wed, Sep 17, 2025 at 4:42 PM Jan Kara jack@suse.cz wrote:

...

On Wed 17-09-25 13:07:45, Amir Goldstein wrote:

...
On Wed, Sep 17, 2025 at 11:25 AM Jan Kara jack@suse.cz wrote:

...
On Tue 16-09-25 15:29:35, Amir Goldstein wrote:

...
On Tue, Sep 16, 2025 at 1:30 PM Jan Kara jack@suse.cz wrote:

...
On Mon 15-09-25 17:29:40, Amir Goldstein wrote:

...
On Mon, Sep 15, 2025 at 4:07 PM Jan Kara jack@suse.cz wrote: > > > diff --git a/fs/overlayfs/export.c b/fs/overlayfs/export.c > > > index 83f80fdb1567..424c73188e06 100644 > > > --- a/fs/overlayfs/export.c > > > +++ b/fs/overlayfs/export.c > > > @@ -195,6 +195,8 @@ static int ovl_check_encode_origin(struct inode *inode) > > > if (!ovl_inode_lower(inode)) > > > return 0; > > > > > > + if (!inode->i_sb->s_root) > > > + return -ENOENT; > > > > For a filesystem method to have to check that its own root is still alive sounds > > like the wrong way to me. > > That's one of the things that should be taken for granted by fs code. > > > > I don't think this is an overlayfs specific issue, because other fs would be > > happy if encode_fh() would be called with NULL sb->s_root. > > Actually, I don't see where that would blow up? Generally references to > sb->s_root in filesystems outside of mount / remount code are pretty rare. > Also most of the code should be unreachable by the time we set sb->s_root > to NULL because there are no open files at that moment, no exports etc. But > as this report shows, there are occasional surprises (I remember similar > issue with ext4 sysfs files handlers using s_root without checking couple > years back). >

I am not sure that I understand what you are arguing for. I did a very naive grep s_root fs/*/export.c and quickly found:

You're better with grep than me ;). I was grepping for '->s_root' as well but all the hits I had looked into were related to mounting and similar and eventually I got bored. Restricting the grep to export ops indeed shows ceph, gfs2 and overlayfs are vulnerable to this kind of problem.

As far as I can tell, ceph uses s_root only in decode_fh methods.

True. But ceph also uses d_find_alias() in ceph_encode_snapfh() which could race with shrink_dcache_for_umount()->do_one_tree() and trigger:
    WARN(1, "BUG: Dentry %p{i=%lx,n=%pd} "
                    " still in use (%d) [unmount of %s %s]\n",
...
ovl and gfs2 only want to know for an inode if it is the root inode, they do not strictly need to dereference s_root for that purpose. (see patch below)

...
So there are not many cases where this can happen but enough that I'd say that handling some events specially to avoid encoding fh on fs while it is unmounted is fragile and prone to breaking again sooner or later.

...
How about skipping fsnotify_inoderemove() in case sb is in shutdown?

Also how would you like to handle that in a race-free manner? We'd need to hold s_umount for that which we cannot really afford in that context. But maybe you have some better idea...

I was only thinking about this code path:

generic_shutdown_super() shrink_dcache_for_umount() ... __dentry_kill() dentry_unlink_inode()

This is supposed to be the last dput of all remaining dentries and I don't think a deferred unlink should be expected in that case.

I see.

...
But I realize now that you mean delayed unlink from another context which races with shutdown.

Yes, I've meant that.

...
...
...
...
...
> > Can we change the order of generic_shutdown_super() so that > > fsnotify_sb_delete(sb) is called before setting s_root to NULL? > > > > Or is there a better solution for this race? > > Regarding calling fsnotify_sb_delete() before setting s_root to NULL: > In 2019 (commit 1edc8eb2e9313 ("fs: call fsnotify_sb_delete after > evict_inodes")) we've moved the call after evict_inodes() because otherwise > we were just wasting cycles scanning many inodes without watches. So moving > it earlier wouldn't be great...

Yes, I noticed that and I figured there were subtleties.

Right. After thinking more about it I think calling fsnotify_sb_delete() earlier is the only practical choice we have (not clearing sb->s_root isn't much of an option - we need to prune all dentries to quiesce the filesystem and leaving s_root alive would create odd corner cases). But you don't want to be iterating millions of inodes just to clear couple of marks so we'll have to figure out something more clever there.

I think we only need to suppress the fsnotify_inoderemove() call. It sounds doable and very local to fs/super.c.

Regarding show_mark_fhandle() WDYT about my suggestion to guard it with super_trylock_shared()?

Yes, super_trylock_shared() for that callsite looks like a fine solution for that call site. Occasional random failures in encoding fh because the trylock fails are unlikely to have any bad consequences there. But I think we need to figure out other possibly racing call-sites as well first.

Might something naive as this be enough?

It looks like it should be good for the problems with gfs2 & overlayfs but it doesn't solve the problem with ceph and as Jakub writes there's a question whether we won't hit more problems later.

I'm sorry for poking holes into your solutions. The more I look into this the more problems I find :-|

On the contrary, Thank you for shooting down my bad ideas ;)

...

As I'm thinking about it I'm slowly leaning towards implementing a list of connectors per sb (so that we can quickly reclaim on umount). It seems stupid just for these corner cases but longer term we can also implement what Dave once suggested [1] so that fsnotify doesn't need to pin inodes in memory at all which should more that make up for the additional memory for inode connector members.
                                                            Honza
[1] https://lore.kernel.org/linux-fsdevel/ZwXDzKGj6Bp28kYe@dread.disaster.area/

Interesting. I'll wait for you to think this over. If you think that it might take some time, maybe we should apply the super_trylock_shared() band aid to show_mark_fhandle() in the meantime. Whatever you think is right.

Thanks, Amir.

Jakub Acs

1 Oct 1 Oct

10:14 a.m.

On Wed, Sep 17, 2025 at 07:51:25PM +0200, Amir Goldstein wrote:

...

On Wed, Sep 17, 2025 at 4:42 PM Jan Kara jack@suse.cz wrote:

...
On Wed 17-09-25 13:07:45, Amir Goldstein wrote:

...
On Wed, Sep 17, 2025 at 11:25 AM Jan Kara jack@suse.cz wrote:

...
On Tue 16-09-25 15:29:35, Amir Goldstein wrote:

...
On Tue, Sep 16, 2025 at 1:30 PM Jan Kara jack@suse.cz wrote:

...
On Mon 15-09-25 17:29:40, Amir Goldstein wrote: > On Mon, Sep 15, 2025 at 4:07 PM Jan Kara jack@suse.cz wrote: > > > > diff --git a/fs/overlayfs/export.c b/fs/overlayfs/export.c > > > > index 83f80fdb1567..424c73188e06 100644 > > > > --- a/fs/overlayfs/export.c > > > > +++ b/fs/overlayfs/export.c > > > > @@ -195,6 +195,8 @@ static int ovl_check_encode_origin(struct inode *inode) > > > > if (!ovl_inode_lower(inode)) > > > > return 0; > > > > > > > > + if (!inode->i_sb->s_root) > > > > + return -ENOENT; > > > > > > For a filesystem method to have to check that its own root is still alive sounds > > > like the wrong way to me. > > > That's one of the things that should be taken for granted by fs code. > > > > > > I don't think this is an overlayfs specific issue, because other fs would be > > > happy if encode_fh() would be called with NULL sb->s_root. > > > > Actually, I don't see where that would blow up? Generally references to > > sb->s_root in filesystems outside of mount / remount code are pretty rare. > > Also most of the code should be unreachable by the time we set sb->s_root > > to NULL because there are no open files at that moment, no exports etc. But > > as this report shows, there are occasional surprises (I remember similar > > issue with ext4 sysfs files handlers using s_root without checking couple > > years back). > > > > I am not sure that I understand what you are arguing for. > I did a very naive grep s_root fs/*/export.c and quickly found:

You're better with grep than me ;). I was grepping for '->s_root' as well but all the hits I had looked into were related to mounting and similar and eventually I got bored. Restricting the grep to export ops indeed shows ceph, gfs2 and overlayfs are vulnerable to this kind of problem.

As far as I can tell, ceph uses s_root only in decode_fh methods.

True. But ceph also uses d_find_alias() in ceph_encode_snapfh() which could race with shrink_dcache_for_umount()->do_one_tree() and trigger:
    WARN(1, "BUG: Dentry %p{i=%lx,n=%pd} "
                    " still in use (%d) [unmount of %s %s]\n",
...
ovl and gfs2 only want to know for an inode if it is the root inode, they do not strictly need to dereference s_root for that purpose. (see patch below)

...
So there are not many cases where this can happen but enough that I'd say that handling some events specially to avoid encoding fh on fs while it is unmounted is fragile and prone to breaking again sooner or later.

...
How about skipping fsnotify_inoderemove() in case sb is in shutdown?

Also how would you like to handle that in a race-free manner? We'd need to hold s_umount for that which we cannot really afford in that context. But maybe you have some better idea...

I was only thinking about this code path:

generic_shutdown_super() shrink_dcache_for_umount() ... __dentry_kill() dentry_unlink_inode()

This is supposed to be the last dput of all remaining dentries and I don't think a deferred unlink should be expected in that case.

I see.

...
But I realize now that you mean delayed unlink from another context which races with shutdown.

Yes, I've meant that.

...
...
...
...
> > > Can we change the order of generic_shutdown_super() so that > > > fsnotify_sb_delete(sb) is called before setting s_root to NULL? > > > > > > Or is there a better solution for this race? > > > > Regarding calling fsnotify_sb_delete() before setting s_root to NULL: > > In 2019 (commit 1edc8eb2e9313 ("fs: call fsnotify_sb_delete after > > evict_inodes")) we've moved the call after evict_inodes() because otherwise > > we were just wasting cycles scanning many inodes without watches. So moving > > it earlier wouldn't be great... > > Yes, I noticed that and I figured there were subtleties.

Right. After thinking more about it I think calling fsnotify_sb_delete() earlier is the only practical choice we have (not clearing sb->s_root isn't much of an option - we need to prune all dentries to quiesce the filesystem and leaving s_root alive would create odd corner cases). But you don't want to be iterating millions of inodes just to clear couple of marks so we'll have to figure out something more clever there.

I think we only need to suppress the fsnotify_inoderemove() call. It sounds doable and very local to fs/super.c.

Regarding show_mark_fhandle() WDYT about my suggestion to guard it with super_trylock_shared()?

Yes, super_trylock_shared() for that callsite looks like a fine solution for that call site. Occasional random failures in encoding fh because the trylock fails are unlikely to have any bad consequences there. But I think we need to figure out other possibly racing call-sites as well first.

Might something naive as this be enough?

It looks like it should be good for the problems with gfs2 & overlayfs but it doesn't solve the problem with ceph and as Jakub writes there's a question whether we won't hit more problems later.

I'm sorry for poking holes into your solutions. The more I look into this the more problems I find :-|
On the contrary, Thank you for shooting down my bad ideas ;)

...
As I'm thinking about it I'm slowly leaning towards implementing a list of connectors per sb (so that we can quickly reclaim on umount). It seems stupid just for these corner cases but longer term we can also implement what Dave once suggested [1] so that fsnotify doesn't need to pin inodes in memory at all which should more that make up for the additional memory for inode connector members.
                                                            Honza
[1] https://lore.kernel.org/linux-fsdevel/ZwXDzKGj6Bp28kYe@dread.disaster.area/
Interesting. I'll wait for you to think this over. If you think that it might take some time, maybe we should apply the super_trylock_shared() band aid to show_mark_fhandle() in the meantime. Whatever you think is right.

Thanks, Amir.

As the discussion came to a halt here, I decided to send the suggested band-aid: https://lore.kernel.org/all/20251001100955.59634-1-acsjakub@amazon.de/

Kind Regards,

Jakub Acs

Amazon Web Services Development Center Germany GmbH Tamara-Danz-Str. 13 10243 Berlin Geschaeftsfuehrung: Christian Schlaeger Eingetragen am Amtsgericht Charlottenburg unter HRB 257764 B Sitz: Berlin Ust-ID: DE 365 538 597

Al Viro

17 Sep 17 Sep

8:42 p.m.

On Wed, Sep 17, 2025 at 01:07:45PM +0200, Amir Goldstein wrote:

...

diff --git a/fs/dcache.c b/fs/dcache.c index 60046ae23d514..8c9d0d6bb0045 100644 --- a/fs/dcache.c +++ b/fs/dcache.c @@ -1999,10 +1999,12 @@ struct dentry *d_make_root(struct inode *root_inode)
    if (root_inode) {
            res = d_alloc_anon(root_inode->i_sb);
          if (res)
          if (res) {
                  root_inode->i_opflags |= IOP_ROOT;
                  d_instantiate(res, root_inode);

Umm... Not a good idea - if nothing else, root may end up being attached someplace (normal with nfs, for example).

But more fundamentally, once we are into ->kill_sb(), let alone generic_shutdown_super(), nobody should be playing silly buggers with the filesystem. Sure, RCU accesses are possible, but messing around with fhandles? ->s_root is not the only thing that might be no longer there.

What the fuck is fsnotify playing at?

Al Viro

8:46 p.m.

On Wed, Sep 17, 2025 at 09:42:00PM +0100, Al Viro wrote:

...

On Wed, Sep 17, 2025 at 01:07:45PM +0200, Amir Goldstein wrote:

...
diff --git a/fs/dcache.c b/fs/dcache.c index 60046ae23d514..8c9d0d6bb0045 100644 --- a/fs/dcache.c +++ b/fs/dcache.c @@ -1999,10 +1999,12 @@ struct dentry *d_make_root(struct inode *root_inode)
    if (root_inode) {
            res = d_alloc_anon(root_inode->i_sb);
          if (res)
          if (res) {
                  root_inode->i_opflags |= IOP_ROOT;
                  d_instantiate(res, root_inode);
Umm... Not a good idea - if nothing else, root may end up being attached someplace (normal with nfs, for example).

But more fundamentally, once we are into ->kill_sb(), let alone generic_shutdown_super(), nobody should be playing silly buggers with the filesystem. Sure, RCU accesses are possible, but messing around with fhandles? ->s_root is not the only thing that might be no longer there.

What the fuck is fsnotify playing at?

PS: there is a whole lot of the logics in e.g. shrink_dcache_for_umount() that relies on nobody else messing with dentry tree by that point.

Jan Kara

18 Sep 18 Sep

9:59 a.m.

On Wed 17-09-25 21:42:00, Al Viro wrote:

...

On Wed, Sep 17, 2025 at 01:07:45PM +0200, Amir Goldstein wrote:

...
diff --git a/fs/dcache.c b/fs/dcache.c index 60046ae23d514..8c9d0d6bb0045 100644 --- a/fs/dcache.c +++ b/fs/dcache.c @@ -1999,10 +1999,12 @@ struct dentry *d_make_root(struct inode *root_inode)
    if (root_inode) {
            res = d_alloc_anon(root_inode->i_sb);
          if (res)
          if (res) {
                  root_inode->i_opflags |= IOP_ROOT;
                  d_instantiate(res, root_inode);
Umm... Not a good idea - if nothing else, root may end up being attached someplace (normal with nfs, for example).

But more fundamentally, once we are into ->kill_sb(), let alone generic_shutdown_super(), nobody should be playing silly buggers with the filesystem. Sure, RCU accesses are possible, but messing around with fhandles? ->s_root is not the only thing that might be no longer there.

What the fuck is fsnotify playing at?

The problem is fsnotify marks aren't shutdown until generic_shutdown_super() calls fsnotify_sb_delete(). So until that moment fsnotify can be generating events for the filesystem. Sure, userspace has no longer access to the fs but stuff like delayed inode deletion or other in-kernel users can still result in events being generated and these events may end up creating file handles to report to userspace.

We have already uncovered with Amir quite a few moments how this is broken so I agree that the best solution is to shutdown fsnotify before we call shrink_dcache_for_umount(). The slight problem is this means iterating all inodes in the sb which is costly when you have millions of them (this is the reason why fsnotify_sb_delete() is currently called after evict_inodes()). So it needs more work on fsnotify side...

Honza

-- Jan Kara jack@suse.com SUSE Labs, CR

101

days inactive

117

days old

linux-stable-mirror@lists.linaro.org

14 comments

participants

tags (0)

participants (4)

Al Viro
Amir Goldstein
Jakub Acs
Jan Kara