On 12/4/24 2:46 PM, Heming Zhao wrote:
On 12/4/24 11:47, Joseph Qi wrote:
On 12/4/24 11:32 AM, Heming Zhao wrote:
This reverts commit dfe6c5692fb5 ("ocfs2: fix the la space leak when unmounting an ocfs2 volume").
In commit dfe6c5692fb5, the commit log stating "This bug has existed since the initial OCFS2 code." is incorrect. The correct introduction commit is 30dd3478c3cd ("ocfs2: correctly use ocfs2_find_next_zero_bit()").
Could you please elaborate more how it happens? And it seems no difference with the new version. So we may submit a standalone revert patch to those backported stable kernels (< 6.10).
commit log from patch [2/2] should be revised. change: This bug has existed since the initial OCFS2 code. to : This bug was introduced by commit 30dd3478c3cd ("ocfs2: correctly use ocfs2_find_next_zero_bit()")
See below for the details of patch [1/2].
following is "the code before commit 30dd3478c3cd7" + "commit dfe6c5692fb525e".
static int ocfs2_sync_local_to_main() { ... ... 1 while ((bit_off = ocfs2_find_next_zero_bit(bitmap, left, start)) 2 != -1) { 3 if ((bit_off < left) && (bit_off == start)) { 4 count++; 5 start++; 6 continue; 7 } 8 if (count) { 9 blkno = la_start_blk + 10 ocfs2_clusters_to_blocks(osb->sb, 11 start - count); 12 13 trace_ocfs2_sync_local_to_main_free(); 14 15 status = ocfs2_release_clusters(handle, 16 main_bm_inode, 17 main_bm_bh, blkno, 18 count); 19 if (status < 0) { 20 mlog_errno(status); 21 goto bail; 22 } 23 } 24 if (bit_off >= left) 25 break; 26 count = 1; 27 start = bit_off + 1; 28 } 29 30 /* clear the contiguous bits until the end boundary */ 31 if (count) { 32 blkno = la_start_blk + 33 ocfs2_clusters_to_blocks(osb->sb, 34 start - count); 35 36 trace_ocfs2_sync_local_to_main_free(); 37 38 status = ocfs2_release_clusters(handle, 39 main_bm_inode, 40 main_bm_bh, blkno, 41 count); 42 if (status < 0) 43 mlog_errno(status); 44 } ... ... }
bug flow:
- the left:10000, start:0, bit_off:9000, and there are zeros from 9000 to the end of bitmap.
- when 'start' is 9999, code runs to line 3, where bit_off is 10000 (the 'left' value), it doesn't trigger line 3.
- code runs to line 8 (where 'count' is 9999), this area releases 9999 bytes of space to main_bm.
- code runs to line 24, triggering "bit_off == left" and 'break' the loop. at this time, the 'count' still retains its old value 9999.
- code runs to line 31, this area code releases space to main_bm for the same gd again.
kernel will report the following likely error: OCFS2: ERROR (device dm-0): ocfs2_block_group_clear_bits: Group descriptor # 349184 has bit count 15872 but claims 19871 are freed. num_bits 7878
Okay, IIUC, it seems we have to: 1. revert commit dfe6c5692fb5 (so does stable kernel). 2. fix 30dd3478c3cd in following way:
diff --git a/fs/ocfs2/localalloc.c b/fs/ocfs2/localalloc.c index 5df34561c551..f0feadac2ef1 100644 --- a/fs/ocfs2/localalloc.c +++ b/fs/ocfs2/localalloc.c @@ -971,9 +971,9 @@ static int ocfs2_sync_local_to_main(struct ocfs2_super *osb, start = count = 0; left = le32_to_cpu(alloc->id1.bitmap1.i_total);
- while ((bit_off = ocfs2_find_next_zero_bit(bitmap, left, start)) < + while ((bit_off = ocfs2_find_next_zero_bit(bitmap, left, start)) <= left) { - if (bit_off == start) { + if ((bit_off < left) && (bit_off == start)) { count++; start++; continue; @@ -997,7 +997,8 @@ static int ocfs2_sync_local_to_main(struct ocfs2_super *osb, goto bail; } } - + if (bit_off >= left) + break; count = 1; start = bit_off + 1; }
Thanks, Joseph