----- Original Message -----
The fix from commit 2d29f6b96d8f introduced an unexpected performance regression when allocating blocks. Fix by rewriting the overly complicated wrap-around logic in gfs2_rbm_find in a more reasonable way. Discovered and verified with iozone.
Fixes: 2d29f6b96d8f ("gfs2: Fix loop in gfs2_rbm_find") Cc: stable@vger.kernel.org # v3.13+ Signed-off-by: Andreas Gruenbacher agruenba@redhat.com
fs/gfs2/rgrp.c | 40 +++++++++++++--------------------------- 1 file changed, 13 insertions(+), 27 deletions(-)
diff --git a/fs/gfs2/rgrp.c b/fs/gfs2/rgrp.c index 831d7cb5a49c4..3f16411d885d9 100644 --- a/fs/gfs2/rgrp.c +++ b/fs/gfs2/rgrp.c @@ -1730,25 +1730,15 @@ static int gfs2_rbm_find(struct gfs2_rbm *rbm, u8 state, u32 *minext, const struct gfs2_inode *ip, bool nowrap) { struct buffer_head *bh;
- int initial_bii;
- u32 initial_offset; int first_bii = rbm->bii; u32 first_offset = rbm->offset; u32 offset; u8 *buffer;
- int n = 0;
- int iters = rbm->rgd->rd_length;
- bool wrapped = false; int ret; struct gfs2_bitmap *bi; struct gfs2_extent maxext = { .rbm.rgd = rbm->rgd, };
- /* If we are not starting at the beginning of a bitmap, then we
* need to add one to the bitmap count to ensure that we search
* the starting bitmap twice.
*/
- if (rbm->offset != 0)
iters++;
- while(1) { bi = rbm_bi(rbm); if ((ip == NULL || !gfs2_rs_active(&ip->i_res)) &&
@@ -1761,47 +1751,43 @@ static int gfs2_rbm_find(struct gfs2_rbm *rbm, u8 state, u32 *minext, WARN_ON(!buffer_uptodate(bh)); if (state != GFS2_BLKST_UNLINKED && bi->bi_clone) buffer = bi->bi_clone + bi->bi_offset;
offset = gfs2_bitfit(buffer, bi->bi_bytes, rbm->offset, state);initial_offset = rbm->offset;
if (offset == BFITNOENT)
goto bitmap_full;
if (offset == BFITNOENT) {
if (state == GFS2_BLKST_FREE && rbm->offset == 0)
set_bit(GBF_FULL, &bi->bi_flags);
goto next_bitmap;
rbm->offset = offset; if (ip == NULL) return 0;}
ret = gfs2_reservation_check_and_update(rbm, ip, minext ? *minext : 0, &maxext); if (ret == 0) return 0;initial_bii = rbm->bii;
if (ret > 0) {
n += (rbm->bii - initial_bii);
if (ret > 0) goto next_iter;
if (ret == -E2BIG) {}
} return ret;n += rbm->bii - initial_bii; rbm->bii = 0; rbm->offset = 0; goto res_covered_end_of_rgrp;
-bitmap_full: /* Mark bitmap as full and fall through */
if ((state == GFS2_BLKST_FREE) && initial_offset == 0)
set_bit(GBF_FULL, &bi->bi_flags);
next_bitmap: /* Find next bitmap in the rgrp */ rbm->offset = 0; rbm->bii++; if (rbm->bii == rbm->rgd->rd_length) rbm->bii = 0; res_covered_end_of_rgrp:
if ((rbm->bii == 0) && nowrap)
break;
n++;
if (rbm->bii == 0) {
if (nowrap)
break;
wrapped = true;
}
next_iter:
if (n >= iters)
}if (wrapped && rbm->bii > first_bii) break;
2.20.1
Hi,
I think you're on the right track here, and I definitely like most of your improvements, but I'm concerned about one thing.
The reason we currently do (rd_length + 1) iterations is, for example, cases where space is really tight and the starting point of the search comes after the last free block.
So simplifying the test down to a tiny file system where there is only one rgrp, which contains the only bitmap. Assuming block size 4K, minus the rgrp header, you've got 0xf80 (3968) bytes possible, or 1984 blocks possible.
Let's say you have two processes, A and B. Let's say A wants to create a dinode, and its "hint" says to start at the (only) bitmap block 1500. But process B already has the rgrp locked because he's freeing a dinode at block 1000, so he marks block 1000 free and releases the lock. Now process A starts searching the bitmap at offset 1500 until the end. The only free block is 1000 (because of process B's free), so A's bitmap search reaches the end of the bitmap and gets BFITNOENT.
With this new patch, it will just go back next_bitmap, but there are no more bitmaps. It sets wrap, but since we're at the end, it returns -ENOSPC, when, in fact, there is a free block, no?
That's why we coded it the way we did: With today's algorithm, it would reach the end of the rgrp, wrap to the beginning again, and find free block 1000.
I know this scenario may seem absurd. My concern is not whether this particular scenario is broken with the patch, but more that the algorithm may cause:
(a) more file system fragmentation (b) more file fragmentation (c) improperly reporting of -ENOSPC (d) or something else.
Before we push this upstream, we should probably run a file system aging test, like sas calibration without unmounts, and compare filefrag results with older kernels that don't have 2d29f6b96d8f. And we should do it with very little free space after, say, 13 iterations. IOW, so it hits free space limitations for many different rgrps.
Regards,
Bob Peterson Red Hat File Systems