From: Xiubo Li xiubli@redhat.com
There is racy between capsnaps flush and removing the inode from 'mdsc->snap_flush_list' list:
Thread A Thread B ceph_queue_cap_snap() -> allocate 'capsnapA' ->ihold('&ci->vfs_inode') ->add 'capsnapA' to 'ci->i_cap_snaps' ->add 'ci' to 'mdsc->snap_flush_list' ... ceph_flush_snaps() ->__ceph_flush_snaps() ->__send_flush_snap() handle_cap_flushsnap_ack() ->iput('&ci->vfs_inode') this also will release 'ci' ... ceph_handle_snap() ->flush_snaps() ->iterate 'mdsc->snap_flush_list' ->get the stale 'ci' ->remove 'ci' from ->ihold(&ci->vfs_inode) this 'mdsc->snap_flush_list' will WARNING
To fix this we will remove the 'ci' from 'mdsc->snap_flush_list' list just before '__send_flush_snaps()' to make sure the flushsnap 'ack' will always after removing the 'ci' from 'snap_flush_list'.
Cc: stable@vger.kernel.org URL: https://bugzilla.redhat.com/show_bug.cgi?id=2209299 Signed-off-by: Xiubo Li xiubli@redhat.com --- fs/ceph/caps.c | 9 +++++---- 1 file changed, 5 insertions(+), 4 deletions(-)
diff --git a/fs/ceph/caps.c b/fs/ceph/caps.c index feabf4cc0c4f..a8f890b3bb9a 100644 --- a/fs/ceph/caps.c +++ b/fs/ceph/caps.c @@ -1595,6 +1595,11 @@ static void __ceph_flush_snaps(struct ceph_inode_info *ci,
dout("__flush_snaps %p session %p\n", inode, session);
+ /* we will flush them all; remove this inode from the queue */ + spin_lock(&mdsc->snap_flush_lock); + list_del_init(&ci->i_snap_flush_item); + spin_unlock(&mdsc->snap_flush_lock); + list_for_each_entry(capsnap, &ci->i_cap_snaps, ci_item) { /* * we need to wait for sync writes to complete and for dirty @@ -1726,10 +1731,6 @@ void ceph_flush_snaps(struct ceph_inode_info *ci, *psession = session; else ceph_put_mds_session(session); - /* we flushed them all; remove this inode from the queue */ - spin_lock(&mdsc->snap_flush_lock); - list_del_init(&ci->i_snap_flush_item); - spin_unlock(&mdsc->snap_flush_lock); }
/*
On Thu, May 25, 2023 at 4:45 AM xiubli@redhat.com wrote:
From: Xiubo Li xiubli@redhat.com
There is racy between capsnaps flush and removing the inode from 'mdsc->snap_flush_list' list:
Thread A Thread B ceph_queue_cap_snap() -> allocate 'capsnapA' ->ihold('&ci->vfs_inode') ->add 'capsnapA' to 'ci->i_cap_snaps' ->add 'ci' to 'mdsc->snap_flush_list' ... ceph_flush_snaps() ->__ceph_flush_snaps() ->__send_flush_snap() handle_cap_flushsnap_ack() ->iput('&ci->vfs_inode') this also will release 'ci' ... ceph_handle_snap() ->flush_snaps() ->iterate 'mdsc->snap_flush_list' ->get the stale 'ci' ->remove 'ci' from ->ihold(&ci->vfs_inode) this 'mdsc->snap_flush_list' will WARNING
To fix this we will remove the 'ci' from 'mdsc->snap_flush_list' list just before '__send_flush_snaps()' to make sure the flushsnap 'ack' will always after removing the 'ci' from 'snap_flush_list'.
Hi Xiubo,
I'm not sure I'm following the logic here. If the issue is that the inode can be released by handle_cap_flushsnap_ack(), meaning that ci is unsafe to dereference after the ack is received, what makes e.g. the following snippet in __ceph_flush_snaps() work:
ret = __send_flush_snap(inode, session, capsnap, cap->mseq, oldest_flush_tid); if (ret < 0) { pr_err("__flush_snaps: error sending cap flushsnap, " "ino (%llx.%llx) tid %llu follows %llu\n", ceph_vinop(inode), cf->tid, capsnap->follows); }
ceph_put_cap_snap(capsnap); spin_lock(&ci->i_ceph_lock);
If the ack is handled after capsnap is put but before ci->i_ceph_lock is reacquired, could use-after-free occur inside spin_lock()?
Thanks,
Ilya
Cc: stable@vger.kernel.org URL: https://bugzilla.redhat.com/show_bug.cgi?id=2209299 Signed-off-by: Xiubo Li xiubli@redhat.com
fs/ceph/caps.c | 9 +++++---- 1 file changed, 5 insertions(+), 4 deletions(-)
diff --git a/fs/ceph/caps.c b/fs/ceph/caps.c index feabf4cc0c4f..a8f890b3bb9a 100644 --- a/fs/ceph/caps.c +++ b/fs/ceph/caps.c @@ -1595,6 +1595,11 @@ static void __ceph_flush_snaps(struct ceph_inode_info *ci,
dout("__flush_snaps %p session %p\n", inode, session);
/* we will flush them all; remove this inode from the queue */
spin_lock(&mdsc->snap_flush_lock);
list_del_init(&ci->i_snap_flush_item);
spin_unlock(&mdsc->snap_flush_lock);
list_for_each_entry(capsnap, &ci->i_cap_snaps, ci_item) { /* * we need to wait for sync writes to complete and for dirty
@@ -1726,10 +1731,6 @@ void ceph_flush_snaps(struct ceph_inode_info *ci, *psession = session; else ceph_put_mds_session(session);
/* we flushed them all; remove this inode from the queue */
spin_lock(&mdsc->snap_flush_lock);
list_del_init(&ci->i_snap_flush_item);
spin_unlock(&mdsc->snap_flush_lock);
}
/*
2.40.1
On 5/31/23 19:09, Ilya Dryomov wrote:
On Thu, May 25, 2023 at 4:45 AM xiubli@redhat.com wrote:
From: Xiubo Li xiubli@redhat.com
There is racy between capsnaps flush and removing the inode from 'mdsc->snap_flush_list' list:
Thread A Thread B
ceph_queue_cap_snap() -> allocate 'capsnapA' ->ihold('&ci->vfs_inode') ->add 'capsnapA' to 'ci->i_cap_snaps' ->add 'ci' to 'mdsc->snap_flush_list' ... ceph_flush_snaps() ->__ceph_flush_snaps() ->__send_flush_snap() handle_cap_flushsnap_ack() ->iput('&ci->vfs_inode') this also will release 'ci' ... ceph_handle_snap() ->flush_snaps() ->iterate 'mdsc->snap_flush_list' ->get the stale 'ci' ->remove 'ci' from ->ihold(&ci->vfs_inode) this 'mdsc->snap_flush_list' will WARNING
To fix this we will remove the 'ci' from 'mdsc->snap_flush_list' list just before '__send_flush_snaps()' to make sure the flushsnap 'ack' will always after removing the 'ci' from 'snap_flush_list'.
Hi Xiubo,
I'm not sure I'm following the logic here. If the issue is that the inode can be released by handle_cap_flushsnap_ack(), meaning that ci is unsafe to dereference after the ack is received, what makes e.g. the following snippet in __ceph_flush_snaps() work:
ret = __send_flush_snap(inode, session, capsnap, cap->mseq, oldest_flush_tid); if (ret < 0) { pr_err("__flush_snaps: error sending cap flushsnap, " "ino (%llx.%llx) tid %llu follows %llu\n", ceph_vinop(inode), cf->tid, capsnap->follows); } ceph_put_cap_snap(capsnap); spin_lock(&ci->i_ceph_lock);
If the ack is handled after capsnap is put but before ci->i_ceph_lock is reacquired, could use-after-free occur inside spin_lock()?
Yeah, certainly this could happen.
After the 'ci' being freed it's possible that the 'ci' is still cached in the 'ceph_inode_cachep' slab caches or reused by others, so dereferring the 'ci' mostly won't crash. But just before freeing 'ci' the 'ci->vfs_inode.i_count' will always be 0 already, this is why we can see the 'WARN_ON(atomic_inc_return(&inode->i_count) < 2)' call trace much easier in use-after-free case.
Thanks
- Xiubo
Thanks,
Ilya
Cc: stable@vger.kernel.org URL: https://bugzilla.redhat.com/show_bug.cgi?id=2209299 Signed-off-by: Xiubo Li xiubli@redhat.com
fs/ceph/caps.c | 9 +++++---- 1 file changed, 5 insertions(+), 4 deletions(-)
diff --git a/fs/ceph/caps.c b/fs/ceph/caps.c index feabf4cc0c4f..a8f890b3bb9a 100644 --- a/fs/ceph/caps.c +++ b/fs/ceph/caps.c @@ -1595,6 +1595,11 @@ static void __ceph_flush_snaps(struct ceph_inode_info *ci,
dout("__flush_snaps %p session %p\n", inode, session);
/* we will flush them all; remove this inode from the queue */
spin_lock(&mdsc->snap_flush_lock);
list_del_init(&ci->i_snap_flush_item);
spin_unlock(&mdsc->snap_flush_lock);
list_for_each_entry(capsnap, &ci->i_cap_snaps, ci_item) { /* * we need to wait for sync writes to complete and for dirty
@@ -1726,10 +1731,6 @@ void ceph_flush_snaps(struct ceph_inode_info *ci, *psession = session; else ceph_put_mds_session(session);
/* we flushed them all; remove this inode from the queue */
spin_lock(&mdsc->snap_flush_lock);
list_del_init(&ci->i_snap_flush_item);
spin_unlock(&mdsc->snap_flush_lock);
}
/*
-- 2.40.1
On Wed, May 31, 2023 at 1:33 PM Xiubo Li xiubli@redhat.com wrote:
On 5/31/23 19:09, Ilya Dryomov wrote:
On Thu, May 25, 2023 at 4:45 AM xiubli@redhat.com wrote:
From: Xiubo Li xiubli@redhat.com
There is racy between capsnaps flush and removing the inode from 'mdsc->snap_flush_list' list:
Thread A Thread B
ceph_queue_cap_snap() -> allocate 'capsnapA' ->ihold('&ci->vfs_inode') ->add 'capsnapA' to 'ci->i_cap_snaps' ->add 'ci' to 'mdsc->snap_flush_list' ... ceph_flush_snaps() ->__ceph_flush_snaps() ->__send_flush_snap() handle_cap_flushsnap_ack() ->iput('&ci->vfs_inode') this also will release 'ci' ... ceph_handle_snap() ->flush_snaps() ->iterate 'mdsc->snap_flush_list' ->get the stale 'ci' ->remove 'ci' from ->ihold(&ci->vfs_inode) this 'mdsc->snap_flush_list' will WARNING
To fix this we will remove the 'ci' from 'mdsc->snap_flush_list' list just before '__send_flush_snaps()' to make sure the flushsnap 'ack' will always after removing the 'ci' from 'snap_flush_list'.
Hi Xiubo,
I'm not sure I'm following the logic here. If the issue is that the inode can be released by handle_cap_flushsnap_ack(), meaning that ci is unsafe to dereference after the ack is received, what makes e.g. the following snippet in __ceph_flush_snaps() work:
ret = __send_flush_snap(inode, session, capsnap, cap->mseq, oldest_flush_tid); if (ret < 0) { pr_err("__flush_snaps: error sending cap flushsnap, " "ino (%llx.%llx) tid %llu follows %llu\n", ceph_vinop(inode), cf->tid, capsnap->follows); } ceph_put_cap_snap(capsnap); spin_lock(&ci->i_ceph_lock);
If the ack is handled after capsnap is put but before ci->i_ceph_lock is reacquired, could use-after-free occur inside spin_lock()?
Yeah, certainly this could happen.
After the 'ci' being freed it's possible that the 'ci' is still cached in the 'ceph_inode_cachep' slab caches or reused by others, so dereferring the 'ci' mostly won't crash. But just before freeing 'ci'
Dereferencing an invalid pointer is a major bug regardless of whether it leads to a crash. Crashing fast is actually a pretty good case as far as potential outcomes go.
This needs to be addressed: just removing ci from mdsc->snap_flush_list earlier obviously doesn't cut it.
Thanks,
Ilya
On 5/31/23 20:01, Ilya Dryomov wrote:
On Wed, May 31, 2023 at 1:33 PM Xiubo Li xiubli@redhat.com wrote:
On 5/31/23 19:09, Ilya Dryomov wrote:
On Thu, May 25, 2023 at 4:45 AM xiubli@redhat.com wrote:
From: Xiubo Li xiubli@redhat.com
There is racy between capsnaps flush and removing the inode from 'mdsc->snap_flush_list' list:
Thread A Thread B
ceph_queue_cap_snap() -> allocate 'capsnapA' ->ihold('&ci->vfs_inode') ->add 'capsnapA' to 'ci->i_cap_snaps' ->add 'ci' to 'mdsc->snap_flush_list' ... ceph_flush_snaps() ->__ceph_flush_snaps() ->__send_flush_snap() handle_cap_flushsnap_ack() ->iput('&ci->vfs_inode') this also will release 'ci' ... ceph_handle_snap() ->flush_snaps() ->iterate 'mdsc->snap_flush_list' ->get the stale 'ci' ->remove 'ci' from ->ihold(&ci->vfs_inode) this 'mdsc->snap_flush_list' will WARNING
To fix this we will remove the 'ci' from 'mdsc->snap_flush_list' list just before '__send_flush_snaps()' to make sure the flushsnap 'ack' will always after removing the 'ci' from 'snap_flush_list'.
Hi Xiubo,
I'm not sure I'm following the logic here. If the issue is that the inode can be released by handle_cap_flushsnap_ack(), meaning that ci is unsafe to dereference after the ack is received, what makes e.g. the following snippet in __ceph_flush_snaps() work:
ret = __send_flush_snap(inode, session, capsnap, cap->mseq, oldest_flush_tid); if (ret < 0) { pr_err("__flush_snaps: error sending cap flushsnap, " "ino (%llx.%llx) tid %llu follows %llu\n", ceph_vinop(inode), cf->tid, capsnap->follows); } ceph_put_cap_snap(capsnap); spin_lock(&ci->i_ceph_lock);
If the ack is handled after capsnap is put but before ci->i_ceph_lock is reacquired, could use-after-free occur inside spin_lock()?
Yeah, certainly this could happen.
After the 'ci' being freed it's possible that the 'ci' is still cached in the 'ceph_inode_cachep' slab caches or reused by others, so dereferring the 'ci' mostly won't crash. But just before freeing 'ci'
Dereferencing an invalid pointer is a major bug regardless of whether it leads to a crash. Crashing fast is actually a pretty good case as far as potential outcomes go.
This needs to be addressed: just removing ci from mdsc->snap_flush_list earlier obviously doesn't cut it.
This could make sure that the 'ci' will be remove from the 'mdsc->snap_flush_list' list before the ack.
While there is another method to fix this, which is to increase the 'i_count' instead when adding the 'ci' to the 'mdsc->snap_flush_list' list. This one seems safer ?
Thanks
- Xiubo
Thanks,
Ilya
linux-stable-mirror@lists.linaro.org