Hi all,
Bug fixes for 6.13.
If you're going to start using this code, I strongly recommend pulling from my git trees, which are linked below.
This has been running on the djcloud for months with no problems. Enjoy! Comments and questions are, as always, welcome.
--D
kernel git tree: https://git.kernel.org/cgit/linux/kernel/git/djwong/xfs-linux.git/log/?h=xfs...
xfsprogs git tree: https://git.kernel.org/cgit/linux/kernel/git/djwong/xfsprogs-dev.git/log/?h=... --- Commits in this patchset: * xfs: fix off-by-one error in fsmap's end_daddr usage * xfs: metapath scrubber should use the already loaded inodes * xfs: keep quota directory inode loaded * xfs: return a 64-bit block count from xfs_btree_count_blocks * xfs: don't drop errno values when we fail to ficlone the entire range * xfs: separate healthy clearing mask during repair * xfs: set XFS_SICK_INO_SYMLINK_ZAPPED explicitly when zapping a symlink * xfs: mark metadir repair tempfiles with IRECOVERY * xfs: fix null bno_hint handling in xfs_rtallocate_rtg * xfs: fix error bailout in xfs_rtginode_create * xfs: update btree keys correctly when _insrec splits an inode root block * xfs: fix scrub tracepoints when inode-rooted btrees are involved * xfs: unlock inodes when erroring out of xfs_trans_alloc_dir * xfs: only run precommits once per transaction object * xfs: remove recursion in __xfs_trans_commit * xfs: don't lose solo superblock counter update transactions * xfs: don't lose solo dquot update transactions * xfs: separate dquot buffer reads from xfs_dqflush * xfs: clean up log item accesses in xfs_qm_dqflush{,_done} * xfs: attach dquot buffer to dquot log item buffer * xfs: convert quotacheck to attach dquot buffers --- fs/xfs/libxfs/xfs_btree.c | 33 +++++-- fs/xfs/libxfs/xfs_btree.h | 2 fs/xfs/libxfs/xfs_ialloc_btree.c | 4 + fs/xfs/libxfs/xfs_rtgroup.c | 2 fs/xfs/scrub/agheader.c | 6 + fs/xfs/scrub/agheader_repair.c | 6 + fs/xfs/scrub/fscounters.c | 2 fs/xfs/scrub/health.c | 57 +++++++----- fs/xfs/scrub/ialloc.c | 4 - fs/xfs/scrub/metapath.c | 68 +++++--------- fs/xfs/scrub/refcount.c | 2 fs/xfs/scrub/scrub.h | 6 + fs/xfs/scrub/symlink_repair.c | 3 - fs/xfs/scrub/tempfile.c | 10 ++ fs/xfs/scrub/trace.h | 2 fs/xfs/xfs_bmap_util.c | 2 fs/xfs/xfs_dquot.c | 185 ++++++++++++++++++++++++++++++++------ fs/xfs/xfs_dquot.h | 5 + fs/xfs/xfs_dquot_item.c | 51 ++++++++-- fs/xfs/xfs_dquot_item.h | 7 + fs/xfs/xfs_file.c | 8 ++ fs/xfs/xfs_fsmap.c | 38 +++++--- fs/xfs/xfs_inode.h | 2 fs/xfs/xfs_qm.c | 92 +++++++++++++------ fs/xfs/xfs_qm.h | 1 fs/xfs/xfs_quota.h | 7 + fs/xfs/xfs_rtalloc.c | 2 fs/xfs/xfs_trans.c | 58 ++++++------ fs/xfs/xfs_trans_ail.c | 2 fs/xfs/xfs_trans_dquot.c | 31 +++++- 30 files changed, 475 insertions(+), 223 deletions(-)
From: Darrick J. Wong djwong@kernel.org
In commit ca6448aed4f10a, we created an "end_daddr" variable to fix fsmap reporting when the end of the range requested falls in the middle of an unknown (aka free on the rmapbt) region. Unfortunately, I didn't notice that the the code sets end_daddr to the last sector of the device but then uses that quantity to compute the length of the synthesized mapping.
Zizhi Wo later observed that when end_daddr isn't set, we still don't report the last fsblock on a device because in that case (aka when info->last is true), the info->high mapping that we pass to xfs_getfsmap_group_helper has a startblock that points to the last fsblock. This is also wrong because the code uses startblock to compute the length of the synthesized mapping.
Fix the second problem by setting end_daddr unconditionally, and fix the first problem by setting start_daddr to one past the end of the range to query.
Cc: stable@vger.kernel.org # v6.11 Fixes: ca6448aed4f10a ("xfs: Fix missing interval for missing_owner in xfs fsmap") Signed-off-by: "Darrick J. Wong" djwong@kernel.org Reported-by: Zizhi Wo wozizhi@huawei.com Reviewed-by: Christoph Hellwig hch@lst.de --- fs/xfs/xfs_fsmap.c | 38 ++++++++++++++++++++++---------------- 1 file changed, 22 insertions(+), 16 deletions(-)
diff --git a/fs/xfs/xfs_fsmap.c b/fs/xfs/xfs_fsmap.c index 82f2e0dd224997..3290dd8524a69a 100644 --- a/fs/xfs/xfs_fsmap.c +++ b/fs/xfs/xfs_fsmap.c @@ -163,7 +163,8 @@ struct xfs_getfsmap_info { xfs_daddr_t next_daddr; /* next daddr we expect */ /* daddr of low fsmap key when we're using the rtbitmap */ xfs_daddr_t low_daddr; - xfs_daddr_t end_daddr; /* daddr of high fsmap key */ + /* daddr of high fsmap key, or the last daddr on the device */ + xfs_daddr_t end_daddr; u64 missing_owner; /* owner of holes */ u32 dev; /* device id */ /* @@ -387,8 +388,8 @@ xfs_getfsmap_group_helper( * we calculated from userspace's high key to synthesize the record. * Note that if the btree query found a mapping, there won't be a gap. */ - if (info->last && info->end_daddr != XFS_BUF_DADDR_NULL) - frec->start_daddr = info->end_daddr; + if (info->last) + frec->start_daddr = info->end_daddr + 1; else frec->start_daddr = xfs_gbno_to_daddr(xg, startblock);
@@ -736,11 +737,10 @@ xfs_getfsmap_rtdev_rtbitmap_helper( * we calculated from userspace's high key to synthesize the record. * Note that if the btree query found a mapping, there won't be a gap. */ - if (info->last && info->end_daddr != XFS_BUF_DADDR_NULL) { - frec.start_daddr = info->end_daddr; - } else { + if (info->last) + frec.start_daddr = info->end_daddr + 1; + else frec.start_daddr = xfs_rtb_to_daddr(mp, start_rtb); - }
frec.len_daddr = XFS_FSB_TO_BB(mp, rtbcount); return xfs_getfsmap_helper(tp, info, &frec); @@ -933,7 +933,10 @@ xfs_getfsmap( struct xfs_trans *tp = NULL; struct xfs_fsmap dkeys[2]; /* per-dev keys */ struct xfs_getfsmap_dev handlers[XFS_GETFSMAP_DEVS]; - struct xfs_getfsmap_info info = { NULL }; + struct xfs_getfsmap_info info = { + .fsmap_recs = fsmap_recs, + .head = head, + }; bool use_rmap; int i; int error = 0; @@ -998,9 +1001,6 @@ xfs_getfsmap(
info.next_daddr = head->fmh_keys[0].fmr_physical + head->fmh_keys[0].fmr_length; - info.end_daddr = XFS_BUF_DADDR_NULL; - info.fsmap_recs = fsmap_recs; - info.head = head;
/* For each device we support... */ for (i = 0; i < XFS_GETFSMAP_DEVS; i++) { @@ -1013,17 +1013,23 @@ xfs_getfsmap( break;
/* - * If this device number matches the high key, we have - * to pass the high key to the handler to limit the - * query results. If the device number exceeds the - * low key, zero out the low key so that we get - * everything from the beginning. + * If this device number matches the high key, we have to pass + * the high key to the handler to limit the query results, and + * set the end_daddr so that we can synthesize records at the + * end of the query range or device. */ if (handlers[i].dev == head->fmh_keys[1].fmr_device) { dkeys[1] = head->fmh_keys[1]; info.end_daddr = min(handlers[i].nr_sectors - 1, dkeys[1].fmr_physical); + } else { + info.end_daddr = handlers[i].nr_sectors - 1; } + + /* + * If the device number exceeds the low key, zero out the low + * key so that we get everything from the beginning. + */ if (handlers[i].dev > head->fmh_keys[0].fmr_device) memset(&dkeys[0], 0, sizeof(struct xfs_fsmap));
From: Darrick J. Wong djwong@kernel.org
With the nrext64 feature enabled, it's possible for a data fork to have 2^48 extent mappings. Even with a 64k fsblock size, that maps out to a bmbt containing more than 2^32 blocks. Therefore, this predicate must return a u64 count to avoid an integer wraparound that will cause scrub to do the wrong thing.
It's unlikely that any such filesystem currently exists, because the incore bmbt would consume more than 64GB of kernel memory on its own, and so far nobody except me has driven a filesystem that far, judging from the lack of complaints.
Cc: stable@vger.kernel.org # v5.19 Fixes: df9ad5cc7a5240 ("xfs: Introduce macros to represent new maximum extent counts for data/attr forks") Signed-off-by: "Darrick J. Wong" djwong@kernel.org Reviewed-by: Christoph Hellwig hch@lst.de --- fs/xfs/libxfs/xfs_btree.c | 4 ++-- fs/xfs/libxfs/xfs_btree.h | 2 +- fs/xfs/libxfs/xfs_ialloc_btree.c | 4 +++- fs/xfs/scrub/agheader.c | 6 +++--- fs/xfs/scrub/agheader_repair.c | 6 +++--- fs/xfs/scrub/fscounters.c | 2 +- fs/xfs/scrub/ialloc.c | 4 ++-- fs/xfs/scrub/refcount.c | 2 +- fs/xfs/xfs_bmap_util.c | 2 +- 9 files changed, 17 insertions(+), 15 deletions(-)
diff --git a/fs/xfs/libxfs/xfs_btree.c b/fs/xfs/libxfs/xfs_btree.c index 2b5fc5fd16435d..c748866ef92368 100644 --- a/fs/xfs/libxfs/xfs_btree.c +++ b/fs/xfs/libxfs/xfs_btree.c @@ -5144,7 +5144,7 @@ xfs_btree_count_blocks_helper( int level, void *data) { - xfs_extlen_t *blocks = data; + xfs_filblks_t *blocks = data; (*blocks)++;
return 0; @@ -5154,7 +5154,7 @@ xfs_btree_count_blocks_helper( int xfs_btree_count_blocks( struct xfs_btree_cur *cur, - xfs_extlen_t *blocks) + xfs_filblks_t *blocks) { *blocks = 0; return xfs_btree_visit_blocks(cur, xfs_btree_count_blocks_helper, diff --git a/fs/xfs/libxfs/xfs_btree.h b/fs/xfs/libxfs/xfs_btree.h index 3b739459ebb0f4..c5bff273cae255 100644 --- a/fs/xfs/libxfs/xfs_btree.h +++ b/fs/xfs/libxfs/xfs_btree.h @@ -484,7 +484,7 @@ typedef int (*xfs_btree_visit_blocks_fn)(struct xfs_btree_cur *cur, int level, int xfs_btree_visit_blocks(struct xfs_btree_cur *cur, xfs_btree_visit_blocks_fn fn, unsigned int flags, void *data);
-int xfs_btree_count_blocks(struct xfs_btree_cur *cur, xfs_extlen_t *blocks); +int xfs_btree_count_blocks(struct xfs_btree_cur *cur, xfs_filblks_t *blocks);
union xfs_btree_rec *xfs_btree_rec_addr(struct xfs_btree_cur *cur, int n, struct xfs_btree_block *block); diff --git a/fs/xfs/libxfs/xfs_ialloc_btree.c b/fs/xfs/libxfs/xfs_ialloc_btree.c index 9b34896dd1a32f..6f270d8f4270cb 100644 --- a/fs/xfs/libxfs/xfs_ialloc_btree.c +++ b/fs/xfs/libxfs/xfs_ialloc_btree.c @@ -744,6 +744,7 @@ xfs_finobt_count_blocks( { struct xfs_buf *agbp = NULL; struct xfs_btree_cur *cur; + xfs_filblks_t blocks; int error;
error = xfs_ialloc_read_agi(pag, tp, 0, &agbp); @@ -751,9 +752,10 @@ xfs_finobt_count_blocks( return error;
cur = xfs_finobt_init_cursor(pag, tp, agbp); - error = xfs_btree_count_blocks(cur, tree_blocks); + error = xfs_btree_count_blocks(cur, &blocks); xfs_btree_del_cursor(cur, error); xfs_trans_brelse(tp, agbp); + *tree_blocks = blocks;
return error; } diff --git a/fs/xfs/scrub/agheader.c b/fs/xfs/scrub/agheader.c index 61f80a6410c738..1d41b85478da9d 100644 --- a/fs/xfs/scrub/agheader.c +++ b/fs/xfs/scrub/agheader.c @@ -458,7 +458,7 @@ xchk_agf_xref_btreeblks( { struct xfs_agf *agf = sc->sa.agf_bp->b_addr; struct xfs_mount *mp = sc->mp; - xfs_agblock_t blocks; + xfs_filblks_t blocks; xfs_agblock_t btreeblks; int error;
@@ -507,7 +507,7 @@ xchk_agf_xref_refcblks( struct xfs_scrub *sc) { struct xfs_agf *agf = sc->sa.agf_bp->b_addr; - xfs_agblock_t blocks; + xfs_filblks_t blocks; int error;
if (!sc->sa.refc_cur) @@ -840,7 +840,7 @@ xchk_agi_xref_fiblocks( struct xfs_scrub *sc) { struct xfs_agi *agi = sc->sa.agi_bp->b_addr; - xfs_agblock_t blocks; + xfs_filblks_t blocks; int error = 0;
if (!xfs_has_inobtcounts(sc->mp)) diff --git a/fs/xfs/scrub/agheader_repair.c b/fs/xfs/scrub/agheader_repair.c index 0fad0baaba2f69..b45d2b32051a63 100644 --- a/fs/xfs/scrub/agheader_repair.c +++ b/fs/xfs/scrub/agheader_repair.c @@ -256,7 +256,7 @@ xrep_agf_calc_from_btrees( struct xfs_agf *agf = agf_bp->b_addr; struct xfs_mount *mp = sc->mp; xfs_agblock_t btreeblks; - xfs_agblock_t blocks; + xfs_filblks_t blocks; int error;
/* Update the AGF counters from the bnobt. */ @@ -946,7 +946,7 @@ xrep_agi_calc_from_btrees( if (error) goto err; if (xfs_has_inobtcounts(mp)) { - xfs_agblock_t blocks; + xfs_filblks_t blocks;
error = xfs_btree_count_blocks(cur, &blocks); if (error) @@ -959,7 +959,7 @@ xrep_agi_calc_from_btrees( agi->agi_freecount = cpu_to_be32(freecount);
if (xfs_has_finobt(mp) && xfs_has_inobtcounts(mp)) { - xfs_agblock_t blocks; + xfs_filblks_t blocks;
cur = xfs_finobt_init_cursor(sc->sa.pag, sc->tp, agi_bp); error = xfs_btree_count_blocks(cur, &blocks); diff --git a/fs/xfs/scrub/fscounters.c b/fs/xfs/scrub/fscounters.c index 4a50f8e0004092..ca23cf4db6c5ef 100644 --- a/fs/xfs/scrub/fscounters.c +++ b/fs/xfs/scrub/fscounters.c @@ -261,7 +261,7 @@ xchk_fscount_btreeblks( struct xchk_fscounters *fsc, xfs_agnumber_t agno) { - xfs_extlen_t blocks; + xfs_filblks_t blocks; int error;
error = xchk_ag_init_existing(sc, agno, &sc->sa); diff --git a/fs/xfs/scrub/ialloc.c b/fs/xfs/scrub/ialloc.c index abad54c3621d44..4dc7c83dc08a40 100644 --- a/fs/xfs/scrub/ialloc.c +++ b/fs/xfs/scrub/ialloc.c @@ -650,8 +650,8 @@ xchk_iallocbt_xref_rmap_btreeblks( struct xfs_scrub *sc) { xfs_filblks_t blocks; - xfs_extlen_t inobt_blocks = 0; - xfs_extlen_t finobt_blocks = 0; + xfs_filblks_t inobt_blocks = 0; + xfs_filblks_t finobt_blocks = 0; int error;
if (!sc->sa.ino_cur || !sc->sa.rmap_cur || diff --git a/fs/xfs/scrub/refcount.c b/fs/xfs/scrub/refcount.c index 2b6be75e942415..1c5e45cc64190c 100644 --- a/fs/xfs/scrub/refcount.c +++ b/fs/xfs/scrub/refcount.c @@ -491,7 +491,7 @@ xchk_refcount_xref_rmap( struct xfs_scrub *sc, xfs_filblks_t cow_blocks) { - xfs_extlen_t refcbt_blocks = 0; + xfs_filblks_t refcbt_blocks = 0; xfs_filblks_t blocks; int error;
diff --git a/fs/xfs/xfs_bmap_util.c b/fs/xfs/xfs_bmap_util.c index a59bbe767a7dc4..0836fea2d6d814 100644 --- a/fs/xfs/xfs_bmap_util.c +++ b/fs/xfs/xfs_bmap_util.c @@ -103,7 +103,7 @@ xfs_bmap_count_blocks( struct xfs_mount *mp = ip->i_mount; struct xfs_ifork *ifp = xfs_ifork_ptr(ip, whichfork); struct xfs_btree_cur *cur; - xfs_extlen_t btblocks = 0; + xfs_filblks_t btblocks = 0; int error;
*nextents = 0;
From: Darrick J. Wong djwong@kernel.org
Way back when we first implemented FICLONE for XFS, life was simple -- either the the entire remapping completed, or something happened and we had to return an errno explaining what happened. Neither of those ioctls support returning partial results, so it's all or nothing.
Then things got complicated when copy_file_range came along, because it actually can return the number of bytes copied, so commit 3f68c1f562f1e4 tried to make it so that we could return a partial result if the REMAP_FILE_CAN_SHORTEN flag is set. This is also how FIDEDUPERANGE can indicate that the kernel performed a partial deduplication.
Unfortunately, the logic is wrong if an error stops the remapping and CAN_SHORTEN is not set. Because those callers cannot return partial results, it is an error for ->remap_file_range to return a positive quantity that is less than the @len passed in. Implementations really should be returning a negative errno in this case, because that's what btrfs (which introduced FICLONE{,RANGE}) did.
Therefore, ->remap_range implementations cannot silently drop an errno that they might have when the number of bytes remapped is less than the number of bytes requested and CAN_SHORTEN is not set.
Found by running generic/562 on a 64k fsblock filesystem and wondering why it reported corrupt files.
Cc: stable@vger.kernel.org # v4.20 Fixes: 3fc9f5e409319e ("xfs: remove xfs_reflink_remap_range") Really-Fixes: 3f68c1f562f1e4 ("xfs: support returning partial reflink results") Signed-off-by: "Darrick J. Wong" djwong@kernel.org Reviewed-by: Christoph Hellwig hch@lst.de --- fs/xfs/xfs_file.c | 8 ++++++++ 1 file changed, 8 insertions(+)
diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c index c6de6b865ef11c..73562ff1c956f0 100644 --- a/fs/xfs/xfs_file.c +++ b/fs/xfs/xfs_file.c @@ -1228,6 +1228,14 @@ xfs_file_remap_range( xfs_iunlock2_remapping(src, dest); if (ret) trace_xfs_reflink_remap_range_error(dest, ret, _RET_IP_); + /* + * If the caller did not set CAN_SHORTEN, then it is not prepared to + * handle partial results -- either the whole remap succeeds, or we + * must say why it did not. In this case, any error should be returned + * to the caller. + */ + if (ret && remapped < len && !(remap_flags & REMAP_FILE_CAN_SHORTEN)) + return ret; return remapped > 0 ? remapped : ret; }
From: Darrick J. Wong djwong@kernel.org
In commit d9041681dd2f53 we introduced some XFS_SICK_*ZAPPED flags so that the inode record repair code could clean up a damaged inode record enough to iget the inode but still be able to remember that the higher level repair code needs to be called. As part of that, we introduced a xchk_mark_healthy_if_clean helper that is supposed to cause the ZAPPED state to be removed if that higher level metadata actually checks out. This was done by setting additional bits in sick_mask hoping that xchk_update_health will clear all those bits after a healthy scrub.
Unfortunately, that's not quite what sick_mask means -- bits in that mask are indeed cleared if the metadata is healthy, but they're set if the metadata is NOT healthy. fsck is only intended to set the ZAPPED bits explicitly.
If something else sets the CORRUPT/XCORRUPT state after the xchk_mark_healthy_if_clean call, we end up marking the metadata zapped. This can happen if the following sequence happens:
1. Scrub runs, discovers that the metadata is fine but could be optimized and calls xchk_mark_healthy_if_clean on a ZAPPED flag. That causes the ZAPPED flag to be set in sick_mask because the metadata is not CORRUPT or XCORRUPT.
2. Repair runs to optimize the metadata.
3. Some other metadata used for cross-referencing in (1) becomes corrupt.
4. Post-repair scrub runs, but this time it sets CORRUPT or XCORRUPT due to the events in (3).
5. Now the xchk_health_update sets the ZAPPED flag on the metadata we just repaired. This is not the correct state.
Fix this by moving the "if healthy" mask to a separate field, and only ever using it to clear the sick state.
Cc: stable@vger.kernel.org # v6.8 Fixes: d9041681dd2f53 ("xfs: set inode sick state flags when we zap either ondisk fork") Signed-off-by: "Darrick J. Wong" djwong@kernel.org Reviewed-by: Christoph Hellwig hch@lst.de --- fs/xfs/scrub/health.c | 57 ++++++++++++++++++++++++++++--------------------- fs/xfs/scrub/scrub.h | 6 +++++ 2 files changed, 39 insertions(+), 24 deletions(-)
diff --git a/fs/xfs/scrub/health.c b/fs/xfs/scrub/health.c index ce86bdad37fa42..ccc6ca5934ca6a 100644 --- a/fs/xfs/scrub/health.c +++ b/fs/xfs/scrub/health.c @@ -71,7 +71,8 @@ /* Map our scrub type to a sick mask and a set of health update functions. */
enum xchk_health_group { - XHG_FS = 1, + XHG_NONE = 1, + XHG_FS, XHG_AG, XHG_INO, XHG_RTGROUP, @@ -83,6 +84,7 @@ struct xchk_health_map { };
static const struct xchk_health_map type_to_health_flag[XFS_SCRUB_TYPE_NR] = { + [XFS_SCRUB_TYPE_PROBE] = { XHG_NONE, 0 }, [XFS_SCRUB_TYPE_SB] = { XHG_AG, XFS_SICK_AG_SB }, [XFS_SCRUB_TYPE_AGF] = { XHG_AG, XFS_SICK_AG_AGF }, [XFS_SCRUB_TYPE_AGFL] = { XHG_AG, XFS_SICK_AG_AGFL }, @@ -133,7 +135,7 @@ xchk_mark_healthy_if_clean( { if (!(sc->sm->sm_flags & (XFS_SCRUB_OFLAG_CORRUPT | XFS_SCRUB_OFLAG_XCORRUPT))) - sc->sick_mask |= mask; + sc->healthy_mask |= mask; }
/* @@ -189,6 +191,7 @@ xchk_update_health( { struct xfs_perag *pag; struct xfs_rtgroup *rtg; + unsigned int mask = sc->sick_mask; bool bad;
/* @@ -203,50 +206,56 @@ xchk_update_health( return; }
- if (!sc->sick_mask) - return; - bad = (sc->sm->sm_flags & (XFS_SCRUB_OFLAG_CORRUPT | XFS_SCRUB_OFLAG_XCORRUPT)); + if (!bad) + mask |= sc->healthy_mask; switch (type_to_health_flag[sc->sm->sm_type].group) { + case XHG_NONE: + break; case XHG_AG: + if (!mask) + return; pag = xfs_perag_get(sc->mp, sc->sm->sm_agno); if (bad) - xfs_group_mark_corrupt(pag_group(pag), sc->sick_mask); + xfs_group_mark_corrupt(pag_group(pag), mask); else - xfs_group_mark_healthy(pag_group(pag), sc->sick_mask); + xfs_group_mark_healthy(pag_group(pag), mask); xfs_perag_put(pag); break; case XHG_INO: if (!sc->ip) return; - if (bad) { - unsigned int mask = sc->sick_mask; - - /* - * If we're coming in for repairs then we don't want - * sickness flags to propagate to the incore health - * status if the inode gets inactivated before we can - * fix it. - */ - if (sc->sm->sm_flags & XFS_SCRUB_IFLAG_REPAIR) - mask |= XFS_SICK_INO_FORGET; + /* + * If we're coming in for repairs then we don't want sickness + * flags to propagate to the incore health status if the inode + * gets inactivated before we can fix it. + */ + if (sc->sm->sm_flags & XFS_SCRUB_IFLAG_REPAIR) + mask |= XFS_SICK_INO_FORGET; + if (!mask) + return; + if (bad) xfs_inode_mark_corrupt(sc->ip, mask); - } else - xfs_inode_mark_healthy(sc->ip, sc->sick_mask); + else + xfs_inode_mark_healthy(sc->ip, mask); break; case XHG_FS: + if (!mask) + return; if (bad) - xfs_fs_mark_corrupt(sc->mp, sc->sick_mask); + xfs_fs_mark_corrupt(sc->mp, mask); else - xfs_fs_mark_healthy(sc->mp, sc->sick_mask); + xfs_fs_mark_healthy(sc->mp, mask); break; case XHG_RTGROUP: + if (!mask) + return; rtg = xfs_rtgroup_get(sc->mp, sc->sm->sm_agno); if (bad) - xfs_group_mark_corrupt(rtg_group(rtg), sc->sick_mask); + xfs_group_mark_corrupt(rtg_group(rtg), mask); else - xfs_group_mark_healthy(rtg_group(rtg), sc->sick_mask); + xfs_group_mark_healthy(rtg_group(rtg), mask); xfs_rtgroup_put(rtg); break; default: diff --git a/fs/xfs/scrub/scrub.h b/fs/xfs/scrub/scrub.h index a7fda3e2b01377..5dbbe93cb49bfa 100644 --- a/fs/xfs/scrub/scrub.h +++ b/fs/xfs/scrub/scrub.h @@ -184,6 +184,12 @@ struct xfs_scrub { */ unsigned int sick_mask;
+ /* + * Clear these XFS_SICK_* flags but only if the scan is ok. Useful for + * removing ZAPPED flags after a repair. + */ + unsigned int healthy_mask; + /* next time we want to cond_resched() */ struct xchk_relax relax;
From: Darrick J. Wong djwong@kernel.org
If we need to reset a symlink target to the "durr it's busted" string, then we clear the zapped flag as well. However, this should be using the provided helper so that we don't set the zapped state on an otherwise ok symlink.
Cc: stable@vger.kernel.org # v6.10 Fixes: 2651923d8d8db0 ("xfs: online repair of symbolic links") Signed-off-by: "Darrick J. Wong" djwong@kernel.org Reviewed-by: Christoph Hellwig hch@lst.de --- fs/xfs/scrub/symlink_repair.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-)
diff --git a/fs/xfs/scrub/symlink_repair.c b/fs/xfs/scrub/symlink_repair.c index d015a86ef460fb..953ce7be78dc2f 100644 --- a/fs/xfs/scrub/symlink_repair.c +++ b/fs/xfs/scrub/symlink_repair.c @@ -36,6 +36,7 @@ #include "scrub/tempfile.h" #include "scrub/tempexch.h" #include "scrub/reap.h" +#include "scrub/health.h"
/* * Symbolic Link Repair @@ -233,7 +234,7 @@ xrep_symlink_salvage( * target zapped flag. */ if (buflen == 0) { - sc->sick_mask |= XFS_SICK_INO_SYMLINK_ZAPPED; + xchk_mark_healthy_if_clean(sc, XFS_SICK_INO_SYMLINK_ZAPPED); sprintf(target_buf, DUMMY_TARGET); }
From: Darrick J. Wong djwong@kernel.org
xfs_bmap_rtalloc initializes the bno_hint variable to NULLRTBLOCK (aka NULLFSBLOCK). If the allocation request is for a file range that's adjacent to an existing mapping, it will then change bno_hint to the blkno hint in the bmalloca structure.
In other words, bno_hint is either a rt block number, or it's all 1s. Unfortunately, commit ec12f97f1b8a8f didn't take the NULLRTBLOCK state into account, which means that it tries to translate that into a realtime extent number. We then end up with an obnoxiously high rtx number and pointlessly feed that to the near allocator. This often fails and falls back to the by-size allocator. Seeing as we had no locality hint anyway, this is a waste of time.
Fix the code to detect a lack of bno_hint correctly. This was detected by running xfs/009 with metadir enabled and a 28k rt extent size.
Cc: stable@vger.kernel.org # v6.12 Fixes: ec12f97f1b8a8f ("xfs: make the rtalloc start hint a xfs_rtblock_t") Signed-off-by: "Darrick J. Wong" djwong@kernel.org Reviewed-by: Christoph Hellwig hch@lst.de --- fs/xfs/xfs_rtalloc.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/fs/xfs/xfs_rtalloc.c b/fs/xfs/xfs_rtalloc.c index 0cb534d71119a5..fcfa6e0eb3ad2a 100644 --- a/fs/xfs/xfs_rtalloc.c +++ b/fs/xfs/xfs_rtalloc.c @@ -1827,7 +1827,7 @@ xfs_rtallocate_rtg( * For an allocation to an empty file at offset 0, pick an extent that * will space things out in the rt area. */ - if (bno_hint) + if (bno_hint != NULLFSBLOCK) start = xfs_rtb_to_rtx(args.mp, bno_hint); else if (!xfs_has_rtgroups(args.mp) && initial_user_data) start = xfs_rtpick_extent(args.rtg, tp, maxlen);
From: Darrick J. Wong djwong@kernel.org
In commit 2c813ad66a72, I partially fixed a bug wherein xfs_btree_insrec would erroneously try to update the parent's key for a block that had been split if we decided to insert the new record into the new block. The solution was to detect this situation and update the in-core key value that we pass up to the caller so that the caller will (eventually) add the new block to the parent level of the tree with the correct key.
However, I missed a subtlety about the way inode-rooted btrees work. If the full block was a maximally sized inode root block, we'll solve that fullness by moving the root block's records to a new block, resizing the root block, and updating the root to point to the new block. We don't pass a pointer to the new block to the caller because that work has already been done. The new record will /always/ land in the new block, so in this case we need to use xfs_btree_update_keys to update the keys.
This bug can theoretically manifest itself in the very rare case that we split a bmbt root block and the new record lands in the very first slot of the new block, though I've never managed to trigger it in practice. However, it is very easy to reproduce by running generic/522 with the realtime rmapbt patchset if rtinherit=1.
Cc: stable@vger.kernel.org # v4.8 Fixes: 2c813ad66a7218 ("xfs: support btrees with overlapping intervals for keys") Signed-off-by: "Darrick J. Wong" djwong@kernel.org --- fs/xfs/libxfs/xfs_btree.c | 29 +++++++++++++++++++++++------ 1 file changed, 23 insertions(+), 6 deletions(-)
diff --git a/fs/xfs/libxfs/xfs_btree.c b/fs/xfs/libxfs/xfs_btree.c index c748866ef92368..68ee1c299c25fd 100644 --- a/fs/xfs/libxfs/xfs_btree.c +++ b/fs/xfs/libxfs/xfs_btree.c @@ -3557,14 +3557,31 @@ xfs_btree_insrec( xfs_btree_log_block(cur, bp, XFS_BB_NUMRECS);
/* - * If we just inserted into a new tree block, we have to - * recalculate nkey here because nkey is out of date. + * Update btree keys to reflect the newly added record or keyptr. + * There are three cases here to be aware of. Normally, all we have to + * do is walk towards the root, updating keys as necessary. * - * Otherwise we're just updating an existing block (having shoved - * some records into the new tree block), so use the regular key - * update mechanism. + * If the caller had us target a full block for the insertion, we dealt + * with that by calling the _make_block_unfull function. If the + * "make unfull" function splits the block, it'll hand us back the key + * and pointer of the new block. We haven't yet added the new block to + * the next level up, so if we decide to add the new record to the new + * block (bp->b_bn != old_bn), we have to update the caller's pointer + * so that the caller adds the new block with the correct key. + * + * However, there is a third possibility-- if the selected block is the + * root block of an inode-rooted btree and cannot be expanded further, + * the "make unfull" function moves the root block contents to a new + * block and updates the root block to point to the new block. In this + * case, no block pointer is passed back because the block has already + * been added to the btree. In this case, we need to use the regular + * key update function, just like the first case. This is critical for + * overlapping btrees, because the high key must be updated to reflect + * the entire tree, not just the subtree accessible through the first + * child of the root (which is now two levels down from the root). */ - if (bp && xfs_buf_daddr(bp) != old_bn) { + if (!xfs_btree_ptr_is_null(cur, &nptr) && + bp && xfs_buf_daddr(bp) != old_bn) { xfs_btree_get_keys(cur, block, lkey); } else if (xfs_btree_needs_key_update(cur, optr)) { error = xfs_btree_update_keys(cur, level);
Looks good:
Reviewed-by: Christoph Hellwig hch@lst.de
From: Darrick J. Wong djwong@kernel.org
Fix a minor mistakes in the scrub tracepoints that can manifest when inode-rooted btrees are enabled. The existing code worked fine for bmap btrees, but we should tighten the code up to be less sloppy.
Cc: stable@vger.kernel.org # v5.7 Fixes: 92219c292af8dd ("xfs: convert btree cursor inode-private member names") Signed-off-by: "Darrick J. Wong" djwong@kernel.org --- fs/xfs/scrub/trace.h | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/fs/xfs/scrub/trace.h b/fs/xfs/scrub/trace.h index 9b38f5ad1eaf07..d2ae7e93acb08e 100644 --- a/fs/xfs/scrub/trace.h +++ b/fs/xfs/scrub/trace.h @@ -605,7 +605,7 @@ TRACE_EVENT(xchk_ifork_btree_op_error, TP_fast_assign( xfs_fsblock_t fsbno = xchk_btree_cur_fsbno(cur, level); __entry->dev = sc->mp->m_super->s_dev; - __entry->ino = sc->ip->i_ino; + __entry->ino = cur->bc_ino.ip->i_ino; __entry->whichfork = cur->bc_ino.whichfork; __entry->type = sc->sm->sm_type; __assign_str(name);
Looks good:
Reviewed-by: Christoph Hellwig hch@lst.de
From: Darrick J. Wong djwong@kernel.org
Debugging a filesystem patch with generic/475 caused the system to hang after observing the following sequences in dmesg:
XFS (dm-0): metadata I/O error in "xfs_imap_to_bp+0x61/0xe0 [xfs]" at daddr 0x491520 len 32 error 5 XFS (dm-0): metadata I/O error in "xfs_btree_read_buf_block+0xba/0x160 [xfs]" at daddr 0x3445608 len 8 error 5 XFS (dm-0): metadata I/O error in "xfs_imap_to_bp+0x61/0xe0 [xfs]" at daddr 0x138e1c0 len 32 error 5 XFS (dm-0): log I/O error -5 XFS (dm-0): Metadata I/O Error (0x1) detected at xfs_trans_read_buf_map+0x1ea/0x4b0 [xfs] (fs/xfs/xfs_trans_buf.c:311). Shutting down filesystem. XFS (dm-0): Please unmount the filesystem and rectify the problem(s) XFS (dm-0): Internal error dqp->q_ino.reserved < dqp->q_ino.count at line 869 of file fs/xfs/xfs_trans_dquot.c. Caller xfs_trans_dqresv+0x236/0x440 [xfs] XFS (dm-0): Corruption detected. Unmount and run xfs_repair XFS (dm-0): Unmounting Filesystem be6bcbcc-9921-4deb-8d16-7cc94e335fa7
The system is stuck in unmount trying to lock a couple of inodes so that they can be purged. The dquot corruption notice above is a clue to what happened -- a link() call tried to set up a transaction to link a child into a directory. Quota reservation for the transaction failed after IO errors shut down the filesystem, but then we forgot to unlock the inodes on our way out. Fix that.
Cc: stable@vger.kernel.org # v6.10 Fixes: bd5562111d5839 ("xfs: Hold inode locks in xfs_trans_alloc_dir") Signed-off-by: "Darrick J. Wong" djwong@kernel.org --- fs/xfs/xfs_trans.c | 3 +++ 1 file changed, 3 insertions(+)
diff --git a/fs/xfs/xfs_trans.c b/fs/xfs/xfs_trans.c index 30fbed27cf05cc..05b18e30368e4b 100644 --- a/fs/xfs/xfs_trans.c +++ b/fs/xfs/xfs_trans.c @@ -1435,5 +1435,8 @@ xfs_trans_alloc_dir(
out_cancel: xfs_trans_cancel(tp); + xfs_iunlock(dp, XFS_ILOCK_EXCL); + if (dp != ip) + xfs_iunlock(ip, XFS_ILOCK_EXCL); return error; }
Looks good:
Reviewed-by: Christoph Hellwig hch@lst.de
From: Darrick J. Wong djwong@kernel.org
Committing a transaction tx0 with a defer ops chain of (A, B, C) creates a chain of transactions that looks like this:
tx0 -> txA -> txB -> txC
Prior to commit cb042117488dbf, __xfs_trans_commit would run precommits on tx0, then call xfs_defer_finish_noroll to convert A-C to tx[A-C]. Unfortunately, after the finish_noroll loop we forgot to run precommits on txC. That was fixed by adding the second precommit call.
Unfortunately, none of us remembered that xfs_defer_finish_noroll calls __xfs_trans_commit a second time to commit tx0 before finishing work A in txA and committing that. In other words, we run precommits twice on tx0:
xfs_trans_commit(tx0) __xfs_trans_commit(tx0, false) xfs_trans_run_precommits(tx0) xfs_defer_finish_noroll(tx0) xfs_trans_roll(tx0) txA = xfs_trans_dup(tx0) __xfs_trans_commit(tx0, true) xfs_trans_run_precommits(tx0)
This currently isn't an issue because the inode item precommit is idempotent; the iunlink item precommit deletes itself so it can't be called again; and the buffer/dquot item precommits only check the incore objects for corruption. However, it doesn't make sense to run precommits twice.
Fix this situation by only running precommits after finish_noroll.
Cc: stable@vger.kernel.org # v6.4 Fixes: cb042117488dbf ("xfs: defered work could create precommits") Signed-off-by: "Darrick J. Wong" djwong@kernel.org --- fs/xfs/xfs_trans.c | 16 ++++------------ 1 file changed, 4 insertions(+), 12 deletions(-)
diff --git a/fs/xfs/xfs_trans.c b/fs/xfs/xfs_trans.c index 05b18e30368e4b..4a517250efc911 100644 --- a/fs/xfs/xfs_trans.c +++ b/fs/xfs/xfs_trans.c @@ -860,13 +860,6 @@ __xfs_trans_commit(
trace_xfs_trans_commit(tp, _RET_IP_);
- error = xfs_trans_run_precommits(tp); - if (error) { - if (tp->t_flags & XFS_TRANS_PERM_LOG_RES) - xfs_defer_cancel(tp); - goto out_unreserve; - } - /* * Finish deferred items on final commit. Only permanent transactions * should ever have deferred ops. @@ -877,13 +870,12 @@ __xfs_trans_commit( error = xfs_defer_finish_noroll(&tp); if (error) goto out_unreserve; - - /* Run precommits from final tx in defer chain. */ - error = xfs_trans_run_precommits(tp); - if (error) - goto out_unreserve; }
+ error = xfs_trans_run_precommits(tp); + if (error) + goto out_unreserve; + /* * If there is nothing to be logged by the transaction, * then unlock all of the items associated with the
Looks good:
Reviewed-by: Christoph Hellwig hch@lst.de
From: Darrick J. Wong djwong@kernel.org
Superblock counter updates are tracked via per-transaction counters in the xfs_trans object. These changes are then turned into dirty log items in xfs_trans_apply_sb_deltas just prior to commiting the log items to the CIL.
However, updating the per-transaction counter deltas do not cause XFS_TRANS_DIRTY to be set on the transaction. In other words, a pure sb counter update will be silently discarded if there are no other dirty log items attached to the transaction.
This is currently not the case anywhere in the filesystem because sb counter updates always dirty at least one other metadata item, but let's not leave a logic bomb.
Cc: stable@vger.kernel.org # v2.6.35 Fixes: 0924378a689ccb ("xfs: split out iclog writing from xfs_trans_commit()") Signed-off-by: "Darrick J. Wong" djwong@kernel.org --- fs/xfs/xfs_trans.c | 9 +++++++-- 1 file changed, 7 insertions(+), 2 deletions(-)
diff --git a/fs/xfs/xfs_trans.c b/fs/xfs/xfs_trans.c index 26bb2343082af4..427a8ba0ab99e2 100644 --- a/fs/xfs/xfs_trans.c +++ b/fs/xfs/xfs_trans.c @@ -860,6 +860,13 @@ __xfs_trans_commit(
trace_xfs_trans_commit(tp, _RET_IP_);
+ /* + * Commit per-transaction changes that are not already tracked through + * log items. This can add dirty log items to the transaction. + */ + if (tp->t_flags & XFS_TRANS_SB_DIRTY) + xfs_trans_apply_sb_deltas(tp); + error = xfs_trans_run_precommits(tp); if (error) goto out_unreserve; @@ -890,8 +897,6 @@ __xfs_trans_commit( /* * If we need to update the superblock, then do it now. */ - if (tp->t_flags & XFS_TRANS_SB_DIRTY) - xfs_trans_apply_sb_deltas(tp); xfs_trans_apply_dquot_deltas(tp);
xlog_cil_commit(log, tp, &commit_seq, regrant);
On Mon, Nov 25, 2024 at 05:28:53PM -0800, Darrick J. Wong wrote:
From: Darrick J. Wong djwong@kernel.org
Superblock counter updates are tracked via per-transaction counters in the xfs_trans object. These changes are then turned into dirty log items in xfs_trans_apply_sb_deltas just prior to commiting the log items to the CIL.
However, updating the per-transaction counter deltas do not cause XFS_TRANS_DIRTY to be set on the transaction. In other words, a pure sb counter update will be silently discarded if there are no other dirty log items attached to the transaction.
This is currently not the case anywhere in the filesystem because sb counter updates always dirty at least one other metadata item, but let's not leave a logic bomb.
xfs_trans_mod_sb is the only place that sets XFS_TRANS_SB_DIRTY, and always forces XFS_TRANS_DIRTY. So this seems kinda intentional, even if this new version is much cleaner. So the change looks fine:
Reviewed-by: Christoph Hellwig hch@lst.de
But I don't think the Fixes tag is really warranted.
On Mon, Nov 25, 2024 at 09:14:07PM -0800, Christoph Hellwig wrote:
On Mon, Nov 25, 2024 at 05:28:53PM -0800, Darrick J. Wong wrote:
From: Darrick J. Wong djwong@kernel.org
Superblock counter updates are tracked via per-transaction counters in the xfs_trans object. These changes are then turned into dirty log items in xfs_trans_apply_sb_deltas just prior to commiting the log items to the CIL.
However, updating the per-transaction counter deltas do not cause XFS_TRANS_DIRTY to be set on the transaction. In other words, a pure sb counter update will be silently discarded if there are no other dirty log items attached to the transaction.
This is currently not the case anywhere in the filesystem because sb counter updates always dirty at least one other metadata item, but let's not leave a logic bomb.
xfs_trans_mod_sb is the only place that sets XFS_TRANS_SB_DIRTY, and always forces XFS_TRANS_DIRTY. So this seems kinda intentional, even if this new version is much cleaner. So the change looks fine:
<nod> I don't think this one was absolutely necessary, it just seemed like a cleanup to put anything that set TRANS_DIRTY before the TRANS_DIRTY check.
Reviewed-by: Christoph Hellwig hch@lst.de
But I don't think the Fixes tag is really warranted.
Dropped.
--D
From: Darrick J. Wong djwong@kernel.org
Quota counter updates are tracked via incore objects which hang off the xfs_trans object. These changes are then turned into dirty log items in xfs_trans_apply_dquot_deltas just prior to commiting the log items to the CIL.
However, updating the incore deltas do not cause XFS_TRANS_DIRTY to be set on the transaction. In other words, a pure quota counter update will be silently discarded if there are no other dirty log items attached to the transaction.
This is currently not the case anywhere in the filesystem because quota updates always dirty at least one other metadata item, but a subsequent bug fix will add dquot log item precommits, so we actually need a dirty dquot log item prior to xfs_trans_run_precommits. Also let's not leave a logic bomb.
Cc: stable@vger.kernel.org # v2.6.35 Fixes: 0924378a689ccb ("xfs: split out iclog writing from xfs_trans_commit()") Signed-off-by: "Darrick J. Wong" djwong@kernel.org --- fs/xfs/xfs_quota.h | 7 ++++--- fs/xfs/xfs_trans.c | 10 +++------- fs/xfs/xfs_trans_dquot.c | 31 ++++++++++++++++++++++++++----- 3 files changed, 33 insertions(+), 15 deletions(-)
diff --git a/fs/xfs/xfs_quota.h b/fs/xfs/xfs_quota.h index fa1317cc396c96..b864ed59787780 100644 --- a/fs/xfs/xfs_quota.h +++ b/fs/xfs/xfs_quota.h @@ -101,7 +101,8 @@ extern void xfs_trans_free_dqinfo(struct xfs_trans *); extern void xfs_trans_mod_dquot_byino(struct xfs_trans *, struct xfs_inode *, uint, int64_t); extern void xfs_trans_apply_dquot_deltas(struct xfs_trans *); -extern void xfs_trans_unreserve_and_mod_dquots(struct xfs_trans *); +void xfs_trans_unreserve_and_mod_dquots(struct xfs_trans *tp, + bool already_locked); int xfs_trans_reserve_quota_nblks(struct xfs_trans *tp, struct xfs_inode *ip, int64_t dblocks, int64_t rblocks, bool force); extern int xfs_trans_reserve_quota_bydquots(struct xfs_trans *, @@ -172,8 +173,8 @@ static inline void xfs_trans_mod_dquot_byino(struct xfs_trans *tp, struct xfs_inode *ip, uint field, int64_t delta) { } -#define xfs_trans_apply_dquot_deltas(tp) -#define xfs_trans_unreserve_and_mod_dquots(tp) +#define xfs_trans_apply_dquot_deltas(tp, a) +#define xfs_trans_unreserve_and_mod_dquots(tp, a) static inline int xfs_trans_reserve_quota_nblks(struct xfs_trans *tp, struct xfs_inode *ip, int64_t dblocks, int64_t rblocks, bool force) diff --git a/fs/xfs/xfs_trans.c b/fs/xfs/xfs_trans.c index 427a8ba0ab99e2..4cd25717c9d130 100644 --- a/fs/xfs/xfs_trans.c +++ b/fs/xfs/xfs_trans.c @@ -866,6 +866,7 @@ __xfs_trans_commit( */ if (tp->t_flags & XFS_TRANS_SB_DIRTY) xfs_trans_apply_sb_deltas(tp); + xfs_trans_apply_dquot_deltas(tp);
error = xfs_trans_run_precommits(tp); if (error) @@ -894,11 +895,6 @@ __xfs_trans_commit(
ASSERT(tp->t_ticket != NULL);
- /* - * If we need to update the superblock, then do it now. - */ - xfs_trans_apply_dquot_deltas(tp); - xlog_cil_commit(log, tp, &commit_seq, regrant);
xfs_trans_free(tp); @@ -924,7 +920,7 @@ __xfs_trans_commit( * the dqinfo portion to be. All that means is that we have some * (non-persistent) quota reservations that need to be unreserved. */ - xfs_trans_unreserve_and_mod_dquots(tp); + xfs_trans_unreserve_and_mod_dquots(tp, true); if (tp->t_ticket) { if (regrant && !xlog_is_shutdown(log)) xfs_log_ticket_regrant(log, tp->t_ticket); @@ -1018,7 +1014,7 @@ xfs_trans_cancel( } #endif xfs_trans_unreserve_and_mod_sb(tp); - xfs_trans_unreserve_and_mod_dquots(tp); + xfs_trans_unreserve_and_mod_dquots(tp, false);
if (tp->t_ticket) { xfs_log_ticket_ungrant(log, tp->t_ticket); diff --git a/fs/xfs/xfs_trans_dquot.c b/fs/xfs/xfs_trans_dquot.c index 481ba3dc9f190d..713b6d243e5631 100644 --- a/fs/xfs/xfs_trans_dquot.c +++ b/fs/xfs/xfs_trans_dquot.c @@ -606,6 +606,24 @@ xfs_trans_apply_dquot_deltas( ASSERT(dqp->q_blk.reserved >= dqp->q_blk.count); ASSERT(dqp->q_ino.reserved >= dqp->q_ino.count); ASSERT(dqp->q_rtb.reserved >= dqp->q_rtb.count); + + /* + * We've applied the count changes and given back + * whatever reservation we didn't use. Zero out the + * dqtrx fields. + */ + qtrx->qt_blk_res = 0; + qtrx->qt_bcount_delta = 0; + qtrx->qt_delbcnt_delta = 0; + + qtrx->qt_rtblk_res = 0; + qtrx->qt_rtblk_res_used = 0; + qtrx->qt_rtbcount_delta = 0; + qtrx->qt_delrtb_delta = 0; + + qtrx->qt_ino_res = 0; + qtrx->qt_ino_res_used = 0; + qtrx->qt_icount_delta = 0; } } } @@ -642,7 +660,8 @@ xfs_trans_unreserve_and_mod_dquots_hook( */ void xfs_trans_unreserve_and_mod_dquots( - struct xfs_trans *tp) + struct xfs_trans *tp, + bool already_locked) { int i, j; struct xfs_dquot *dqp; @@ -671,10 +690,12 @@ xfs_trans_unreserve_and_mod_dquots( * about the number of blocks used field, or deltas. * Also we don't bother to zero the fields. */ - locked = false; + locked = already_locked; if (qtrx->qt_blk_res) { - xfs_dqlock(dqp); - locked = true; + if (!locked) { + xfs_dqlock(dqp); + locked = true; + } dqp->q_blk.reserved -= (xfs_qcnt_t)qtrx->qt_blk_res; } @@ -695,7 +716,7 @@ xfs_trans_unreserve_and_mod_dquots( dqp->q_rtb.reserved -= (xfs_qcnt_t)qtrx->qt_rtblk_res; } - if (locked) + if (locked && !already_locked) xfs_dqunlock(dqp);
}
On Mon, Nov 25, 2024 at 05:29:08PM -0800, Darrick J. Wong wrote:
This is currently not the case anywhere in the filesystem because quota updates always dirty at least one other metadata item, but a subsequent bug fix will add dquot log item precommits, so we actually need a dirty dquot log item prior to xfs_trans_run_precommits. Also let's not leave a logic bomb.
Unlike for the sb updates there doesn't seem to be anything in xfs_trans_mod_dquot that forces the transaction to be dirty, so I guess the fixes tag is fine here.
Reviewed-by: Christoph Hellwig hch@lst.de
On Mon, Nov 25, 2024 at 09:18:14PM -0800, Christoph Hellwig wrote:
On Mon, Nov 25, 2024 at 05:29:08PM -0800, Darrick J. Wong wrote:
This is currently not the case anywhere in the filesystem because quota updates always dirty at least one other metadata item, but a subsequent bug fix will add dquot log item precommits, so we actually need a dirty dquot log item prior to xfs_trans_run_precommits. Also let's not leave a logic bomb.
Unlike for the sb updates there doesn't seem to be anything in xfs_trans_mod_dquot that forces the transaction to be dirty, so I guess the fixes tag is fine here.
Correct.
Reviewed-by: Christoph Hellwig hch@lst.de
Thanks!
--D
From: Darrick J. Wong djwong@kernel.org
Ever since 6.12-rc1, I've observed a pile of warnings from the kernel when running fstests with quotas enabled:
WARNING: CPU: 1 PID: 458580 at mm/page_alloc.c:4221 __alloc_pages_noprof+0xc9c/0xf18 CPU: 1 UID: 0 PID: 458580 Comm: xfsaild/sda3 Tainted: G W 6.12.0-rc6-djwa #rc6 6ee3e0e531f6457e2d26aa008a3b65ff184b377c <snip> Call trace: __alloc_pages_noprof+0xc9c/0xf18 alloc_pages_mpol_noprof+0x94/0x240 alloc_pages_noprof+0x68/0xf8 new_slab+0x3e0/0x568 ___slab_alloc+0x5a0/0xb88 __slab_alloc.constprop.0+0x7c/0xf8 __kmalloc_noprof+0x404/0x4d0 xfs_buf_get_map+0x594/0xde0 [xfs 384cb02810558b4c490343c164e9407332118f88] xfs_buf_read_map+0x64/0x2e0 [xfs 384cb02810558b4c490343c164e9407332118f88] xfs_trans_read_buf_map+0x1dc/0x518 [xfs 384cb02810558b4c490343c164e9407332118f88] xfs_qm_dqflush+0xac/0x468 [xfs 384cb02810558b4c490343c164e9407332118f88] xfs_qm_dquot_logitem_push+0xe4/0x148 [xfs 384cb02810558b4c490343c164e9407332118f88] xfsaild+0x3f4/0xde8 [xfs 384cb02810558b4c490343c164e9407332118f88] kthread+0x110/0x128 ret_from_fork+0x10/0x20 ---[ end trace 0000000000000000 ]---
This corresponds to the line:
WARN_ON_ONCE(current->flags & PF_MEMALLOC);
within the NOFAIL checks. What's happening here is that the XFS AIL is trying to write a disk quota update back into the filesystem, but for that it needs to read the ondisk buffer for the dquot. The buffer is not in memory anymore, probably because it was evicted. Regardless, the buffer cache tries to allocate a new buffer, but those allocations are NOFAIL. The AIL thread has marked itself PF_MEMALLOC (aka noreclaim) since commit 43ff2122e6492b ("xfs: on-stack delayed write buffer lists") presumably because reclaim can push on XFS to push on the AIL.
An easy way to fix this probably would have been to drop the NOFAIL flag from the xfs_buf allocation and open code a retry loop, but then there's still the problem that for bs>ps filesystems, the buffer itself could require up to 64k worth of pages.
Inode items had similar behavior (multi-page cluster buffers that we don't want to allocate in the AIL) which we solved by making transaction precommit attach the inode cluster buffers to the dirty log item. Let's solve the dquot problem in the same way.
So: Make a real precommit handler to read the dquot buffer and attach it to the log item; pass it to dqflush in the push method; and have the iodone function detach the buffer once we've flushed everything. Add a state flag to the log item to track when a thread has entered the precommit -> push mechanism to skip the detaching if it turns out that the dquot is very busy, as we don't hold the dquot lock between log item commit and AIL push).
Reading and attaching the dquot buffer in the precommit hook is inspired by the work done for inode cluster buffers some time ago.
Cc: stable@vger.kernel.org # v6.12 Fixes: 903edea6c53f09 ("mm: warn about illegal __GFP_NOFAIL usage in a more appropriate location and manner") Signed-off-by: "Darrick J. Wong" djwong@kernel.org --- fs/xfs/xfs_dquot.c | 120 +++++++++++++++++++++++++++++++++++++++++++++-- fs/xfs/xfs_dquot.h | 5 ++ fs/xfs/xfs_dquot_item.c | 39 ++++++++++----- fs/xfs/xfs_dquot_item.h | 7 +++ fs/xfs/xfs_qm.c | 6 +- fs/xfs/xfs_trans_ail.c | 2 - 6 files changed, 155 insertions(+), 24 deletions(-)
diff --git a/fs/xfs/xfs_dquot.c b/fs/xfs/xfs_dquot.c index 4ba042786cfb7b..c495f7ad80018f 100644 --- a/fs/xfs/xfs_dquot.c +++ b/fs/xfs/xfs_dquot.c @@ -75,8 +75,24 @@ void xfs_qm_dqdestroy( struct xfs_dquot *dqp) { + struct xfs_dq_logitem *qlip = &dqp->q_logitem; + struct xfs_buf *bp = NULL; + ASSERT(list_empty(&dqp->q_lru));
+ /* + * Detach the dquot buffer if it's still attached, because we can get + * called through dqpurge after a log shutdown. + */ + spin_lock(&qlip->qli_lock); + if (qlip->qli_item.li_buf) { + bp = qlip->qli_item.li_buf; + qlip->qli_item.li_buf = NULL; + } + spin_unlock(&qlip->qli_lock); + if (bp) + xfs_buf_rele(bp); + kvfree(dqp->q_logitem.qli_item.li_lv_shadow); mutex_destroy(&dqp->q_qlock);
@@ -1146,6 +1162,7 @@ xfs_qm_dqflush_done( container_of(lip, struct xfs_dq_logitem, qli_item); struct xfs_dquot *dqp = qlip->qli_dquot; struct xfs_ail *ailp = lip->li_ailp; + struct xfs_buf *bp = NULL; xfs_lsn_t tail_lsn;
/* @@ -1175,6 +1192,19 @@ xfs_qm_dqflush_done( * Release the dq's flush lock since we're done with it. */ xfs_dqfunlock(dqp); + + /* + * If this dquot hasn't been dirtied since initiating the last dqflush, + * release the buffer reference. + */ + spin_lock(&qlip->qli_lock); + if (!qlip->qli_dirty) { + bp = lip->li_buf; + lip->li_buf = NULL; + } + spin_unlock(&qlip->qli_lock); + if (bp) + xfs_buf_rele(bp); }
void @@ -1197,7 +1227,7 @@ xfs_buf_dquot_io_fail(
spin_lock(&bp->b_mount->m_ail->ail_lock); list_for_each_entry(lip, &bp->b_li_list, li_bio_list) - xfs_set_li_failed(lip, bp); + set_bit(XFS_LI_FAILED, &lip->li_flags); spin_unlock(&bp->b_mount->m_ail->ail_lock); }
@@ -1249,6 +1279,7 @@ int xfs_dquot_read_buf( struct xfs_trans *tp, struct xfs_dquot *dqp, + xfs_buf_flags_t xbf_flags, struct xfs_buf **bpp) { struct xfs_mount *mp = dqp->q_mount; @@ -1256,7 +1287,7 @@ xfs_dquot_read_buf( int error;
error = xfs_trans_read_buf(mp, tp, mp->m_ddev_targp, dqp->q_blkno, - mp->m_quotainfo->qi_dqchunklen, XBF_TRYLOCK, + mp->m_quotainfo->qi_dqchunklen, xbf_flags, &bp, &xfs_dquot_buf_ops); if (error == -EAGAIN) return error; @@ -1275,6 +1306,77 @@ xfs_dquot_read_buf( return error; }
+/* + * Attach a dquot buffer to this dquot to avoid allocating a buffer during a + * dqflush, since dqflush can be called from reclaim context. + */ +int +xfs_dquot_attach_buf( + struct xfs_trans *tp, + struct xfs_dquot *dqp) +{ + struct xfs_dq_logitem *qlip = &dqp->q_logitem; + struct xfs_log_item *lip = &qlip->qli_item; + int error; + + spin_lock(&qlip->qli_lock); + if (!lip->li_buf) { + struct xfs_buf *bp = NULL; + + spin_unlock(&qlip->qli_lock); + error = xfs_dquot_read_buf(tp, dqp, 0, &bp); + if (error) + return error; + + /* + * Attach the dquot to the buffer so that the AIL does not have + * to read the dquot buffer to push this item. + */ + xfs_buf_hold(bp); + spin_lock(&qlip->qli_lock); + lip->li_buf = bp; + xfs_trans_brelse(tp, bp); + } + qlip->qli_dirty = true; + spin_unlock(&qlip->qli_lock); + + return 0; +} + +/* + * Get a new reference the dquot buffer attached to this dquot for a dqflush + * operation. + * + * Returns 0 and a NULL bp if none was attached to the dquot; 0 and a locked + * bp; or -EAGAIN if the buffer could not be locked. + */ +int +xfs_dquot_use_attached_buf( + struct xfs_dquot *dqp, + struct xfs_buf **bpp) +{ + struct xfs_buf *bp = dqp->q_logitem.qli_item.li_buf; + + /* + * A NULL buffer can happen if the dquot dirty flag was set but the + * filesystem shut down before transaction commit happened. In that + * case we're not going to flush anyway. + */ + if (!bp) { + ASSERT(xfs_is_shutdown(dqp->q_mount)); + + *bpp = NULL; + return 0; + } + + if (!xfs_buf_trylock(bp)) + return -EAGAIN; + + xfs_buf_hold(bp); + *bpp = bp; + return 0; +} + /* * Write a modified dquot to disk. * The dquot must be locked and the flush lock too taken by caller. @@ -1289,7 +1391,8 @@ xfs_qm_dqflush( struct xfs_buf *bp) { struct xfs_mount *mp = dqp->q_mount; - struct xfs_log_item *lip = &dqp->q_logitem.qli_item; + struct xfs_dq_logitem *qlip = &dqp->q_logitem; + struct xfs_log_item *lip = &qlip->qli_item; struct xfs_dqblk *dqblk; xfs_failaddr_t fa; int error; @@ -1319,8 +1422,15 @@ xfs_qm_dqflush( */ dqp->q_flags &= ~XFS_DQFLAG_DIRTY;
- xfs_trans_ail_copy_lsn(mp->m_ail, &dqp->q_logitem.qli_flush_lsn, - &lip->li_lsn); + /* + * We hold the dquot lock, so nobody can dirty it while we're + * scheduling the write out. Clear the dirty-since-flush flag. + */ + spin_lock(&qlip->qli_lock); + qlip->qli_dirty = false; + spin_unlock(&qlip->qli_lock); + + xfs_trans_ail_copy_lsn(mp->m_ail, &qlip->qli_flush_lsn, &lip->li_lsn);
/* * copy the lsn into the on-disk dquot now while we have the in memory diff --git a/fs/xfs/xfs_dquot.h b/fs/xfs/xfs_dquot.h index 50f8404c41176c..362ca34f7c248b 100644 --- a/fs/xfs/xfs_dquot.h +++ b/fs/xfs/xfs_dquot.h @@ -215,7 +215,7 @@ void xfs_dquot_to_disk(struct xfs_disk_dquot *ddqp, struct xfs_dquot *dqp);
void xfs_qm_dqdestroy(struct xfs_dquot *dqp); int xfs_dquot_read_buf(struct xfs_trans *tp, struct xfs_dquot *dqp, - struct xfs_buf **bpp); + xfs_buf_flags_t flags, struct xfs_buf **bpp); int xfs_qm_dqflush(struct xfs_dquot *dqp, struct xfs_buf *bp); void xfs_qm_dqunpin_wait(struct xfs_dquot *dqp); void xfs_qm_adjust_dqtimers(struct xfs_dquot *d); @@ -239,6 +239,9 @@ void xfs_dqlockn(struct xfs_dqtrx *q);
void xfs_dquot_set_prealloc_limits(struct xfs_dquot *);
+int xfs_dquot_attach_buf(struct xfs_trans *tp, struct xfs_dquot *dqp); +int xfs_dquot_use_attached_buf(struct xfs_dquot *dqp, struct xfs_buf **bpp); + static inline struct xfs_dquot *xfs_qm_dqhold(struct xfs_dquot *dqp) { xfs_dqlock(dqp); diff --git a/fs/xfs/xfs_dquot_item.c b/fs/xfs/xfs_dquot_item.c index 56ecc5ed01934d..271b195ebb9326 100644 --- a/fs/xfs/xfs_dquot_item.c +++ b/fs/xfs/xfs_dquot_item.c @@ -123,8 +123,9 @@ xfs_qm_dquot_logitem_push( __releases(&lip->li_ailp->ail_lock) __acquires(&lip->li_ailp->ail_lock) { - struct xfs_dquot *dqp = DQUOT_ITEM(lip)->qli_dquot; - struct xfs_buf *bp = lip->li_buf; + struct xfs_dq_logitem *qlip = DQUOT_ITEM(lip); + struct xfs_dquot *dqp = qlip->qli_dquot; + struct xfs_buf *bp; uint rval = XFS_ITEM_SUCCESS; int error;
@@ -155,11 +156,10 @@ xfs_qm_dquot_logitem_push(
spin_unlock(&lip->li_ailp->ail_lock);
- error = xfs_dquot_read_buf(NULL, dqp, &bp); - if (error) { - if (error == -EAGAIN) - rval = XFS_ITEM_LOCKED; + error = xfs_dquot_use_attached_buf(dqp, &bp); + if (error == -EAGAIN) { xfs_dqfunlock(dqp); + rval = XFS_ITEM_LOCKED; goto out_relock_ail; }
@@ -207,12 +207,10 @@ xfs_qm_dquot_logitem_committing( }
#ifdef DEBUG_EXPENSIVE -static int -xfs_qm_dquot_logitem_precommit( - struct xfs_trans *tp, - struct xfs_log_item *lip) +static void +xfs_qm_dquot_logitem_precommit_check( + struct xfs_dquot *dqp) { - struct xfs_dquot *dqp = DQUOT_ITEM(lip)->qli_dquot; struct xfs_mount *mp = dqp->q_mount; struct xfs_disk_dquot ddq = { }; xfs_failaddr_t fa; @@ -228,13 +226,24 @@ xfs_qm_dquot_logitem_precommit( xfs_force_shutdown(mp, SHUTDOWN_CORRUPT_INCORE); ASSERT(fa == NULL); } - - return 0; } #else -# define xfs_qm_dquot_logitem_precommit NULL +# define xfs_qm_dquot_logitem_precommit_check(...) ((void)0) #endif
+static int +xfs_qm_dquot_logitem_precommit( + struct xfs_trans *tp, + struct xfs_log_item *lip) +{ + struct xfs_dq_logitem *qlip = DQUOT_ITEM(lip); + struct xfs_dquot *dqp = qlip->qli_dquot; + + xfs_qm_dquot_logitem_precommit_check(dqp); + + return xfs_dquot_attach_buf(tp, dqp); +} + static const struct xfs_item_ops xfs_dquot_item_ops = { .iop_size = xfs_qm_dquot_logitem_size, .iop_precommit = xfs_qm_dquot_logitem_precommit, @@ -259,5 +268,7 @@ xfs_qm_dquot_logitem_init(
xfs_log_item_init(dqp->q_mount, &lp->qli_item, XFS_LI_DQUOT, &xfs_dquot_item_ops); + spin_lock_init(&lp->qli_lock); lp->qli_dquot = dqp; + lp->qli_dirty = false; } diff --git a/fs/xfs/xfs_dquot_item.h b/fs/xfs/xfs_dquot_item.h index 794710c2447493..d66e52807d76d5 100644 --- a/fs/xfs/xfs_dquot_item.h +++ b/fs/xfs/xfs_dquot_item.h @@ -14,6 +14,13 @@ struct xfs_dq_logitem { struct xfs_log_item qli_item; /* common portion */ struct xfs_dquot *qli_dquot; /* dquot ptr */ xfs_lsn_t qli_flush_lsn; /* lsn at last flush */ + + /* + * We use this spinlock to coordinate access to the li_buf pointer in + * the log item and the qli_dirty flag. + */ + spinlock_t qli_lock; + bool qli_dirty; /* dirtied since last flush? */ };
void xfs_qm_dquot_logitem_init(struct xfs_dquot *dqp); diff --git a/fs/xfs/xfs_qm.c b/fs/xfs/xfs_qm.c index 341fe4821c2d77..a79c4a1bf27fab 100644 --- a/fs/xfs/xfs_qm.c +++ b/fs/xfs/xfs_qm.c @@ -148,7 +148,7 @@ xfs_qm_dqpurge( * We don't care about getting disk errors here. We need * to purge this dquot anyway, so we go ahead regardless. */ - error = xfs_dquot_read_buf(NULL, dqp, &bp); + error = xfs_dquot_read_buf(NULL, dqp, XBF_TRYLOCK, &bp); if (error == -EAGAIN) { xfs_dqfunlock(dqp); dqp->q_flags &= ~XFS_DQFLAG_FREEING; @@ -506,7 +506,7 @@ xfs_qm_dquot_isolate( /* we have to drop the LRU lock to flush the dquot */ spin_unlock(lru_lock);
- error = xfs_dquot_read_buf(NULL, dqp, &bp); + error = xfs_dquot_read_buf(NULL, dqp, XBF_TRYLOCK, &bp); if (error) { xfs_dqfunlock(dqp); goto out_unlock_dirty; @@ -1512,7 +1512,7 @@ xfs_qm_flush_one( goto out_unlock; }
- error = xfs_dquot_read_buf(NULL, dqp, &bp); + error = xfs_dquot_read_buf(NULL, dqp, XBF_TRYLOCK, &bp); if (error) goto out_unlock;
diff --git a/fs/xfs/xfs_trans_ail.c b/fs/xfs/xfs_trans_ail.c index 8ede9d099d1fea..f56d62dced97b1 100644 --- a/fs/xfs/xfs_trans_ail.c +++ b/fs/xfs/xfs_trans_ail.c @@ -360,7 +360,7 @@ xfsaild_resubmit_item(
/* protected by ail_lock */ list_for_each_entry(lip, &bp->b_li_list, li_bio_list) { - if (bp->b_flags & _XBF_INODES) + if (bp->b_flags & (_XBF_INODES | _XBF_DQUOTS)) clear_bit(XFS_LI_FAILED, &lip->li_flags); else xfs_clear_li_failed(lip);
On Mon, Nov 25, 2024 at 05:29:55PM -0800, Darrick J. Wong wrote:
- spin_lock(&qlip->qli_lock);
- if (qlip->qli_item.li_buf) {
bp = qlip->qli_item.li_buf;
qlip->qli_item.li_buf = NULL;
- }
- spin_unlock(&qlip->qli_lock);
- if (bp)
xfs_buf_rele(bp);
- spin_lock(&qlip->qli_lock);
- if (!qlip->qli_dirty) {
bp = lip->li_buf;
lip->li_buf = NULL;
- }
- spin_unlock(&qlip->qli_lock);
- if (bp)
xfs_buf_rele(bp);
Should this move into a common helper and always use either the buf or dirty check instead of mixing them?
Otherwise looks good:
Reviewed-by: Christoph Hellwig hch@lst.de
From: Darrick J. Wong djwong@kernel.org
Now that we've converted the dquot logging machinery to attach the dquot buffer to the li_buf pointer so that the AIL dqflush doesn't have to allocate or read buffers in a reclaim path, do the same for the quotacheck code so that the reclaim shrinker dqflush call doesn't have to do that either.
Cc: stable@vger.kernel.org # v6.12 Fixes: 903edea6c53f09 ("mm: warn about illegal __GFP_NOFAIL usage in a more appropriate location and manner") Signed-off-by: "Darrick J. Wong" djwong@kernel.org --- fs/xfs/xfs_dquot.c | 9 +++------ fs/xfs/xfs_dquot.h | 2 -- fs/xfs/xfs_qm.c | 18 +++++++++++++----- 3 files changed, 16 insertions(+), 13 deletions(-)
diff --git a/fs/xfs/xfs_dquot.c b/fs/xfs/xfs_dquot.c index c495f7ad80018f..c47f95c96fe0cf 100644 --- a/fs/xfs/xfs_dquot.c +++ b/fs/xfs/xfs_dquot.c @@ -1275,11 +1275,10 @@ xfs_qm_dqflush_check( * Requires dquot flush lock, will clear the dirty flag, delete the quota log * item from the AIL, and shut down the system if something goes wrong. */ -int +static int xfs_dquot_read_buf( struct xfs_trans *tp, struct xfs_dquot *dqp, - xfs_buf_flags_t xbf_flags, struct xfs_buf **bpp) { struct xfs_mount *mp = dqp->q_mount; @@ -1287,10 +1286,8 @@ xfs_dquot_read_buf( int error;
error = xfs_trans_read_buf(mp, tp, mp->m_ddev_targp, dqp->q_blkno, - mp->m_quotainfo->qi_dqchunklen, xbf_flags, + mp->m_quotainfo->qi_dqchunklen, 0, &bp, &xfs_dquot_buf_ops); - if (error == -EAGAIN) - return error; if (xfs_metadata_is_sick(error)) xfs_dquot_mark_sick(dqp); if (error) @@ -1324,7 +1321,7 @@ xfs_dquot_attach_buf( struct xfs_buf *bp = NULL;
spin_unlock(&qlip->qli_lock); - error = xfs_dquot_read_buf(tp, dqp, 0, &bp); + error = xfs_dquot_read_buf(tp, dqp, &bp); if (error) return error;
diff --git a/fs/xfs/xfs_dquot.h b/fs/xfs/xfs_dquot.h index 362ca34f7c248b..1c5c911615bf7f 100644 --- a/fs/xfs/xfs_dquot.h +++ b/fs/xfs/xfs_dquot.h @@ -214,8 +214,6 @@ void xfs_dquot_to_disk(struct xfs_disk_dquot *ddqp, struct xfs_dquot *dqp); #define XFS_DQ_IS_DIRTY(dqp) ((dqp)->q_flags & XFS_DQFLAG_DIRTY)
void xfs_qm_dqdestroy(struct xfs_dquot *dqp); -int xfs_dquot_read_buf(struct xfs_trans *tp, struct xfs_dquot *dqp, - xfs_buf_flags_t flags, struct xfs_buf **bpp); int xfs_qm_dqflush(struct xfs_dquot *dqp, struct xfs_buf *bp); void xfs_qm_dqunpin_wait(struct xfs_dquot *dqp); void xfs_qm_adjust_dqtimers(struct xfs_dquot *d); diff --git a/fs/xfs/xfs_qm.c b/fs/xfs/xfs_qm.c index a79c4a1bf27fab..e073ad51af1a3d 100644 --- a/fs/xfs/xfs_qm.c +++ b/fs/xfs/xfs_qm.c @@ -148,13 +148,13 @@ xfs_qm_dqpurge( * We don't care about getting disk errors here. We need * to purge this dquot anyway, so we go ahead regardless. */ - error = xfs_dquot_read_buf(NULL, dqp, XBF_TRYLOCK, &bp); + error = xfs_dquot_use_attached_buf(dqp, &bp); if (error == -EAGAIN) { xfs_dqfunlock(dqp); dqp->q_flags &= ~XFS_DQFLAG_FREEING; goto out_unlock; } - if (error) + if (!bp) goto out_funlock;
/* @@ -506,8 +506,8 @@ xfs_qm_dquot_isolate( /* we have to drop the LRU lock to flush the dquot */ spin_unlock(lru_lock);
- error = xfs_dquot_read_buf(NULL, dqp, XBF_TRYLOCK, &bp); - if (error) { + error = xfs_dquot_use_attached_buf(dqp, &bp); + if (!bp || error == -EAGAIN) { xfs_dqfunlock(dqp); goto out_unlock_dirty; } @@ -1330,6 +1330,10 @@ xfs_qm_quotacheck_dqadjust( return error; }
+ error = xfs_dquot_attach_buf(NULL, dqp); + if (error) + return error; + trace_xfs_dqadjust(dqp);
/* @@ -1512,9 +1516,13 @@ xfs_qm_flush_one( goto out_unlock; }
- error = xfs_dquot_read_buf(NULL, dqp, XBF_TRYLOCK, &bp); + error = xfs_dquot_use_attached_buf(dqp, &bp); if (error) goto out_unlock; + if (!bp) { + error = -EFSCORRUPTED; + goto out_unlock; + }
error = xfs_qm_dqflush(dqp, bp); if (!error)
Looks good:
Reviewed-by: Christoph Hellwig hch@lst.de
On Mon, Nov 25, 2024 at 05:30:11PM -0800, Darrick J. Wong wrote:
From: Darrick J. Wong djwong@kernel.org
Now that we've converted the dquot logging machinery to attach the dquot buffer to the li_buf pointer so that the AIL dqflush doesn't have to allocate or read buffers in a reclaim path, do the same for the quotacheck code so that the reclaim shrinker dqflush call doesn't have to do that either.
Cc: stable@vger.kernel.org # v6.12 Fixes: 903edea6c53f09 ("mm: warn about illegal __GFP_NOFAIL usage in a more appropriate location and manner") Signed-off-by: "Darrick J. Wong" djwong@kernel.org
fs/xfs/xfs_dquot.c | 9 +++------ fs/xfs/xfs_dquot.h | 2 -- fs/xfs/xfs_qm.c | 18 +++++++++++++----- 3 files changed, 16 insertions(+), 13 deletions(-)
diff --git a/fs/xfs/xfs_dquot.c b/fs/xfs/xfs_dquot.c index c495f7ad80018f..c47f95c96fe0cf 100644 --- a/fs/xfs/xfs_dquot.c +++ b/fs/xfs/xfs_dquot.c @@ -1275,11 +1275,10 @@ xfs_qm_dqflush_check(
- Requires dquot flush lock, will clear the dirty flag, delete the quota log
- item from the AIL, and shut down the system if something goes wrong.
*/ -int +static int xfs_dquot_read_buf( struct xfs_trans *tp, struct xfs_dquot *dqp,
- xfs_buf_flags_t xbf_flags, struct xfs_buf **bpp)
{ struct xfs_mount *mp = dqp->q_mount; @@ -1287,10 +1286,8 @@ xfs_dquot_read_buf( int error; error = xfs_trans_read_buf(mp, tp, mp->m_ddev_targp, dqp->q_blkno,
mp->m_quotainfo->qi_dqchunklen, xbf_flags,
mp->m_quotainfo->qi_dqchunklen, 0, &bp, &xfs_dquot_buf_ops);
- if (error == -EAGAIN)
if (xfs_metadata_is_sick(error)) xfs_dquot_mark_sick(dqp); if (error)return error;
@@ -1324,7 +1321,7 @@ xfs_dquot_attach_buf( struct xfs_buf *bp = NULL; spin_unlock(&qlip->qli_lock);
error = xfs_dquot_read_buf(tp, dqp, 0, &bp);
if (error) return error;error = xfs_dquot_read_buf(tp, dqp, &bp);
diff --git a/fs/xfs/xfs_dquot.h b/fs/xfs/xfs_dquot.h index 362ca34f7c248b..1c5c911615bf7f 100644 --- a/fs/xfs/xfs_dquot.h +++ b/fs/xfs/xfs_dquot.h @@ -214,8 +214,6 @@ void xfs_dquot_to_disk(struct xfs_disk_dquot *ddqp, struct xfs_dquot *dqp); #define XFS_DQ_IS_DIRTY(dqp) ((dqp)->q_flags & XFS_DQFLAG_DIRTY) void xfs_qm_dqdestroy(struct xfs_dquot *dqp); -int xfs_dquot_read_buf(struct xfs_trans *tp, struct xfs_dquot *dqp,
xfs_buf_flags_t flags, struct xfs_buf **bpp);
int xfs_qm_dqflush(struct xfs_dquot *dqp, struct xfs_buf *bp); void xfs_qm_dqunpin_wait(struct xfs_dquot *dqp); void xfs_qm_adjust_dqtimers(struct xfs_dquot *d); diff --git a/fs/xfs/xfs_qm.c b/fs/xfs/xfs_qm.c index a79c4a1bf27fab..e073ad51af1a3d 100644 --- a/fs/xfs/xfs_qm.c +++ b/fs/xfs/xfs_qm.c @@ -148,13 +148,13 @@ xfs_qm_dqpurge( * We don't care about getting disk errors here. We need * to purge this dquot anyway, so we go ahead regardless. */
error = xfs_dquot_read_buf(NULL, dqp, XBF_TRYLOCK, &bp);
if (error == -EAGAIN) { xfs_dqfunlock(dqp); dqp->q_flags &= ~XFS_DQFLAG_FREEING; goto out_unlock; }error = xfs_dquot_use_attached_buf(dqp, &bp);
if (error)
if (!bp) goto out_funlock;
/* @@ -506,8 +506,8 @@ xfs_qm_dquot_isolate( /* we have to drop the LRU lock to flush the dquot */ spin_unlock(lru_lock);
error = xfs_dquot_read_buf(NULL, dqp, XBF_TRYLOCK, &bp);
if (error) {
error = xfs_dquot_use_attached_buf(dqp, &bp);
}if (!bp || error == -EAGAIN) { xfs_dqfunlock(dqp); goto out_unlock_dirty;
@@ -1330,6 +1330,10 @@ xfs_qm_quotacheck_dqadjust( return error; }
- error = xfs_dquot_attach_buf(NULL, dqp);
- if (error)
return error;
- trace_xfs_dqadjust(dqp);
/* @@ -1512,9 +1516,13 @@ xfs_qm_flush_one( goto out_unlock; }
- error = xfs_dquot_read_buf(NULL, dqp, XBF_TRYLOCK, &bp);
- error = xfs_dquot_use_attached_buf(dqp, &bp); if (error) goto out_unlock;
- if (!bp) {
error = -EFSCORRUPTED;
goto out_unlock;
- }
error = xfs_qm_dqflush(dqp, bp); if (!error)
Hi Darrick J. Wong,
Greetings!
I used Syzkaller and found that there is possible deadlock in xfs_dquot_detach_buf in linux v6.13-rc3.
After bisection and the first bad commit is: " ca378189fdfa xfs: convert quotacheck to attach dquot buffers "
All detailed into can be found at: https://github.com/laifryiee/syzkaller_logs/tree/main/241216_192201_xfs_dquo... Syzkaller repro code: https://github.com/laifryiee/syzkaller_logs/tree/main/241216_192201_xfs_dquo... Syzkaller repro syscall steps: https://github.com/laifryiee/syzkaller_logs/tree/main/241216_192201_xfs_dquo... Syzkaller report: https://github.com/laifryiee/syzkaller_logs/tree/main/241216_192201_xfs_dquo... Kconfig(make olddefconfig): https://github.com/laifryiee/syzkaller_logs/tree/main/241216_192201_xfs_dquo... Bisect info: https://github.com/laifryiee/syzkaller_logs/tree/main/241216_192201_xfs_dquo... bzImage: https://github.com/laifryiee/syzkaller_logs/raw/refs/heads/main/241216_19220... Issue dmesg: https://github.com/laifryiee/syzkaller_logs/blob/main/241216_192201_xfs_dquo...
" [ 52.971391] ====================================================== [ 52.971706] WARNING: possible circular locking dependency detected [ 52.972026] 6.13.0-rc3-78d4f34e2115+ #1 Not tainted [ 52.972282] ------------------------------------------------------ [ 52.972596] repro/673 is trying to acquire lock: [ 52.972842] ffff88802366b510 (&lp->qli_lock){+.+.}-{3:3}, at: xfs_dquot_detach_buf+0x2d/0x190 [ 52.973324] [ 52.973324] but task is already holding lock: [ 52.973617] ffff888015681b30 (&l->lock){+.+.}-{3:3}, at: __list_lru_walk_one+0x409/0x810 [ 52.974039] [ 52.974039] which lock already depends on the new lock. [ 52.974039] [ 52.974442] [ 52.974442] the existing dependency chain (in reverse order) is: [ 52.974815] [ 52.974815] -> #3 (&l->lock){+.+.}-{3:3}: [ 52.975100] lock_acquire+0x80/0xb0 [ 52.975319] _raw_spin_lock+0x38/0x50 [ 52.975550] list_lru_add+0x198/0x5d0 [ 52.975770] list_lru_add_obj+0x207/0x360 [ 52.976008] xfs_buf_rele+0xcb6/0x1610 [ 52.976243] xfs_trans_brelse+0x385/0x410 [ 52.976484] xfs_imap_lookup+0x38d/0x5a0 [ 52.976719] xfs_imap+0x668/0xc80 [ 52.976923] xfs_iget+0x875/0x2dd0 [ 52.977129] xfs_mountfs+0x116b/0x2060 [ 52.977360] xfs_fs_fill_super+0x12bc/0x1f10 [ 52.977612] get_tree_bdev_flags+0x3d8/0x6c0 [ 52.977869] get_tree_bdev+0x29/0x40 [ 52.978086] xfs_fs_get_tree+0x26/0x30 [ 52.978310] vfs_get_tree+0x9e/0x390 [ 52.978526] path_mount+0x707/0x2000 [ 52.978742] __x64_sys_mount+0x2bf/0x350 [ 52.978974] x64_sys_call+0x1e1d/0x2140 [ 52.979210] do_syscall_64+0x6d/0x140 [ 52.979431] entry_SYSCALL_64_after_hwframe+0x76/0x7e [ 52.979723] [ 52.979723] -> #2 (&bch->bc_lock){+.+.}-{3:3}: [ 52.980029] lock_acquire+0x80/0xb0 [ 52.980240] _raw_spin_lock+0x38/0x50 [ 52.980456] _atomic_dec_and_lock+0xb8/0x100 [ 52.980712] xfs_buf_rele+0x112/0x1610 [ 52.980937] xfs_trans_brelse+0x385/0x410 [ 52.981175] xfs_imap_lookup+0x38d/0x5a0 [ 52.981410] xfs_imap+0x668/0xc80 [ 52.981612] xfs_iget+0x875/0x2dd0 [ 52.981816] xfs_mountfs+0x116b/0x2060 [ 52.982042] xfs_fs_fill_super+0x12bc/0x1f10 [ 52.982289] get_tree_bdev_flags+0x3d8/0x6c0 [ 52.982540] get_tree_bdev+0x29/0x40 [ 52.982756] xfs_fs_get_tree+0x26/0x30 [ 52.982979] vfs_get_tree+0x9e/0x390 [ 52.983194] path_mount+0x707/0x2000 [ 52.983406] __x64_sys_mount+0x2bf/0x350 [ 52.983637] x64_sys_call+0x1e1d/0x2140 [ 52.983865] do_syscall_64+0x6d/0x140 [ 52.984081] entry_SYSCALL_64_after_hwframe+0x76/0x7e [ 52.984366] [ 52.984366] -> #1 (&bp->b_lock){+.+.}-{3:3}: [ 52.984665] lock_acquire+0x80/0xb0 [ 52.984876] _raw_spin_lock+0x38/0x50 [ 52.985092] xfs_buf_rele+0x106/0x1610 [ 52.985319] xfs_trans_brelse+0x385/0x410 [ 52.985556] xfs_dquot_attach_buf+0x312/0x490 [ 52.985806] xfs_qm_quotacheck_dqadjust+0x122/0x580 [ 52.986083] xfs_qm_dqusage_adjust+0x5e0/0x7c0 [ 52.986340] xfs_iwalk_ag_recs+0x423/0x780 [ 52.986579] xfs_iwalk_run_callbacks+0x1e2/0x540 [ 52.986842] xfs_iwalk_ag+0x6e6/0x920 [ 52.987061] xfs_iwalk_ag_work+0x160/0x1e0 [ 52.987301] xfs_pwork_work+0x8b/0x180 [ 52.987528] process_one_work+0x92e/0x1b60 [ 52.987770] worker_thread+0x68d/0xe90 [ 52.987992] kthread+0x35a/0x470 [ 52.988186] ret_from_fork+0x56/0x90 [ 52.988401] ret_from_fork_asm+0x1a/0x30 [ 52.988627] [ 52.988627] -> #0 (&lp->qli_lock){+.+.}-{3:3}: [ 52.988924] __lock_acquire+0x2ff8/0x5d60 [ 52.989156] lock_acquire.part.0+0x142/0x390 [ 52.989402] lock_acquire+0x80/0xb0 [ 52.989609] _raw_spin_lock+0x38/0x50 [ 52.989820] xfs_dquot_detach_buf+0x2d/0x190 [ 52.990061] xfs_qm_dquot_isolate+0x1c6/0x12f0 [ 52.990312] __list_lru_walk_one+0x31a/0x810 [ 52.990553] list_lru_walk_one+0x4c/0x60 [ 52.990781] xfs_qm_shrink_scan+0x1d0/0x380 [ 52.991020] do_shrink_slab+0x410/0x10a0 [ 52.991253] shrink_slab+0x349/0x1370 [ 52.991469] drop_slab+0xf5/0x1f0 [ 52.991667] drop_caches_sysctl_handler+0x179/0x1a0 [ 52.991943] proc_sys_call_handler+0x418/0x610 [ 52.992197] proc_sys_write+0x2c/0x40 [ 52.992409] vfs_write+0xc59/0x1140 [ 52.992613] ksys_write+0x14f/0x280 [ 52.992817] __x64_sys_write+0x7b/0xc0 [ 52.993034] x64_sys_call+0x16b3/0x2140 [ 52.993259] do_syscall_64+0x6d/0x140 [ 52.993470] entry_SYSCALL_64_after_hwframe+0x76/0x7e [ 52.993748] [ 52.993748] other info that might help us debug this: [ 52.993748] [ 52.994133] Chain exists of: [ 52.994133] &lp->qli_lock --> &bch->bc_lock --> &l->lock [ 52.994133] [ 52.994613] Possible unsafe locking scenario: [ 52.994613] [ 52.994901] CPU0 CPU1 [ 52.995125] ---- ---- [ 52.995348] lock(&l->lock); [ 52.995505] lock(&bch->bc_lock); [ 52.995797] lock(&l->lock); [ 52.996067] lock(&lp->qli_lock); [ 52.996242] [ 52.996242] *** DEADLOCK *** [ 52.996242] [ 52.996530] 3 locks held by repro/673: [ 52.996719] #0: ffff888012734408 (sb_writers#5){.+.+}-{0:0}, at: ksys_write+0x14f/0x280 [ 52.997127] #1: ffff888015681b30 (&l->lock){+.+.}-{3:3}, at: __list_lru_walk_one+0x409/0x810 [ 52.997559] #2: ffff88802366b5f8 (&dqp->q_qlock){+.+.}-{4:4}, at: xfs_qm_dquot_isolate+0x8f/0x12f0 [ 52.998011] [ 52.998011] stack backtrace: [ 52.998230] CPU: 1 UID: 0 PID: 673 Comm: repro Not tainted 6.13.0-rc3-78d4f34e2115+ #1 [ 52.998617] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.0-0-gd239552ce722-prebuilt.qem4 [ 52.999165] Call Trace: [ 52.999296] <TASK> [ 52.999411] dump_stack_lvl+0xea/0x150 [ 52.999609] dump_stack+0x19/0x20 [ 52.999786] print_circular_bug+0x47f/0x750 [ 53.000002] check_noncircular+0x2f4/0x3e0 [ 53.000213] ? __pfx_check_noncircular+0x10/0x10 [ 53.000453] ? __pfx_lockdep_lock+0x10/0x10 [ 53.000668] __lock_acquire+0x2ff8/0x5d60 [ 53.000881] ? __pfx___lock_acquire+0x10/0x10 [ 53.001105] ? __kasan_check_read+0x15/0x20 [ 53.001324] ? mark_lock.part.0+0xf3/0x17b0 [ 53.001539] ? __this_cpu_preempt_check+0x21/0x30 [ 53.001779] ? lock_acquire.part.0+0x152/0x390 [ 53.002008] lock_acquire.part.0+0x142/0x390 [ 53.002228] ? xfs_dquot_detach_buf+0x2d/0x190 [ 53.002456] ? __pfx_lock_acquire.part.0+0x10/0x10 [ 53.002702] ? debug_smp_processor_id+0x20/0x30 [ 53.002933] ? rcu_is_watching+0x19/0xc0 [ 53.003141] ? trace_lock_acquire+0x13d/0x1b0 [ 53.003366] lock_acquire+0x80/0xb0 [ 53.003549] ? xfs_dquot_detach_buf+0x2d/0x190 [ 53.003776] _raw_spin_lock+0x38/0x50 [ 53.003965] ? xfs_dquot_detach_buf+0x2d/0x190 [ 53.004190] xfs_dquot_detach_buf+0x2d/0x190 [ 53.004408] xfs_qm_dquot_isolate+0x1c6/0x12f0 [ 53.004638] ? __pfx_xfs_qm_dquot_isolate+0x10/0x10 [ 53.004886] ? lock_acquire+0x80/0xb0 [ 53.005080] __list_lru_walk_one+0x31a/0x810 [ 53.005306] ? __pfx_xfs_qm_dquot_isolate+0x10/0x10 [ 53.005555] ? __pfx_xfs_qm_dquot_isolate+0x10/0x10 [ 53.005803] list_lru_walk_one+0x4c/0x60 [ 53.006006] xfs_qm_shrink_scan+0x1d0/0x380 [ 53.006221] ? __pfx_xfs_qm_shrink_scan+0x10/0x10 [ 53.006465] do_shrink_slab+0x410/0x10a0 [ 53.006673] shrink_slab+0x349/0x1370 [ 53.006864] ? __this_cpu_preempt_check+0x21/0x30 [ 53.007103] ? lock_release+0x441/0x870 [ 53.007303] ? __pfx_lock_release+0x10/0x10 [ 53.007517] ? shrink_slab+0x161/0x1370 [ 53.007717] ? __pfx_shrink_slab+0x10/0x10 [ 53.007931] ? mem_cgroup_iter+0x3a6/0x670 [ 53.008147] drop_slab+0xf5/0x1f0 [ 53.008324] drop_caches_sysctl_handler+0x179/0x1a0 [ 53.008575] proc_sys_call_handler+0x418/0x610 [ 53.008803] ? __pfx_proc_sys_call_handler+0x10/0x10 [ 53.009054] ? rcu_is_watching+0x19/0xc0 [ 53.009263] ? __this_cpu_preempt_check+0x21/0x30 [ 53.009504] proc_sys_write+0x2c/0x40 [ 53.009694] vfs_write+0xc59/0x1140 [ 53.009876] ? __pfx_proc_sys_write+0x10/0x10 [ 53.010101] ? __pfx_vfs_write+0x10/0x10 [ 53.010307] ? __sanitizer_cov_trace_const_cmp8+0x1c/0x30 [ 53.010583] ksys_write+0x14f/0x280 [ 53.010765] ? __pfx_ksys_write+0x10/0x10 [ 53.010971] ? __audit_syscall_entry+0x39c/0x500 [ 53.011211] __x64_sys_write+0x7b/0xc0 [ 53.011404] ? syscall_trace_enter+0x14f/0x280 [ 53.011633] x64_sys_call+0x16b3/0x2140 [ 53.011831] do_syscall_64+0x6d/0x140 [ 53.012020] entry_SYSCALL_64_after_hwframe+0x76/0x7e [ 53.012275] RIP: 0033:0x7f56d423ee5d [ 53.012463] Code: ff c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 8 [ 53.013348] RSP: 002b:00007ffcc7ab57f8 EFLAGS: 00000202 ORIG_RAX: 0000000000000001 [ 53.013723] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f56d423ee5d [ 53.014070] RDX: 0000000000000001 RSI: 0000000020000080 RDI: 0000000000000004 [ 53.014416] RBP: 00007ffcc7ab5810 R08: 00007ffcc7ab5810 R09: 00007ffcc7ab5810 [ 53.014763] R10: 002c646975756f6e R11: 0000000000000202 R12: 00007ffcc7ab5968 [ 53.015110] R13: 0000000000402d04 R14: 000000000041ae08 R15: 00007f56d45aa000 [ 53.015463] </TASK> "
Regards, Yi Lai
---
If you don't need the following environment to reproduce the problem or if you already have one reproduced environment, please ignore the following information.
How to reproduce: git clone https://gitlab.com/xupengfe/repro_vm_env.git cd repro_vm_env tar -xvf repro_vm_env.tar.gz cd repro_vm_env; ./start3.sh // it needs qemu-system-x86_64 and I used v7.1.0 // start3.sh will load bzImage_2241ab53cbb5cdb08a6b2d4688feb13971058f65 v6.2-rc5 kernel // You could change the bzImage_xxx as you want // Maybe you need to remove line "-drive if=pflash,format=raw,readonly=on,file=./OVMF_CODE.fd " for different qemu version You could use below command to log in, there is no password for root. ssh -p 10023 root@localhost
After login vm(virtual machine) successfully, you could transfer reproduced binary to the vm by below way, and reproduce the problem in vm: gcc -pthread -o repro repro.c scp -P 10023 repro root@localhost:/root/
Get the bzImage for target kernel: Please use target kconfig and copy it to kernel_src/.config make olddefconfig make -jx bzImage //x should equal or less than cpu num your pc has
Fill the bzImage file into above start3.sh to load the target kernel in vm.
Tips: If you already have qemu-system-x86_64, please ignore below info. If you want to install qemu v7.1.0 version: git clone https://github.com/qemu/qemu.git cd qemu git checkout -f v7.1.0 mkdir build cd build yum install -y ninja-build.x86_64 yum -y install libslirp-devel.x86_64 ../configure --target-list=x86_64-softmmu --enable-kvm --enable-vnc --enable-gtk --enable-sdl --enable-usb-redir --enable-slirp make make install
linux-stable-mirror@lists.linaro.org