inline data handling has a race between writing and writing to a memory map.
When ext4_page_mkwrite is called, it calls ext4_convert_inline_data, which destroys the inline data, but if block allocation fails, restores the inline data. In that process, we could have:
CPU1 CPU2 destroy_inline_data write_begin (does not see inline data) restory_inline_data write_end (sees inline data)
This leads to bugs like the one below, as write_begin did not prepare for the case of inline data, which is expected by the write_end side of it.
------------[ cut here ]------------ kernel BUG at fs/ext4/inline.c:235! Oops: invalid opcode: 0000 [#1] PREEMPT SMP KASAN NOPTI CPU: 1 UID: 0 PID: 5838 Comm: syz-executor110 Not tainted 6.13.0-rc3-syzkaller-00209-g499551201b5f #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 09/13/2024 RIP: 0010:ext4_write_inline_data fs/ext4/inline.c:235 [inline] RIP: 0010:ext4_write_inline_data_end+0xdc7/0xdd0 fs/ext4/inline.c:774 Code: 47 1d 8c e8 4b 3a 91 ff 90 0f 0b e8 63 7a 47 ff 48 8b 7c 24 10 48 c7 c6 e0 47 1d 8c e8 32 3a 91 ff 90 0f 0b e8 4a 7a 47 ff 90 <0f> 0b 0f 1f 80 00 00 00 00 90 90 90 90 90 90 90 90 90 90 90 90 90 RSP: 0018:ffffc900031c7320 EFLAGS: 00010293 RAX: ffffffff8257f9a6 RBX: 000000000000005a RCX: ffff888012968000 RDX: 0000000000000000 RSI: 000000000000005a RDI: 000000000000005b RBP: ffffc900031c7448 R08: ffffffff8257ef87 R09: 1ffff11006806070 R10: dffffc0000000000 R11: ffffed1006806071 R12: 000000000000005a R13: dffffc0000000000 R14: ffff888076b65bd8 R15: 000000000000005b FS: 00007f5c6bacf6c0(0000) GS:ffff8880b8700000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000000020000a00 CR3: 0000000073fb6000 CR4: 0000000000350ef0 Call Trace: <TASK> generic_perform_write+0x6f8/0x990 mm/filemap.c:4070 ext4_buffered_write_iter+0xc5/0x350 fs/ext4/file.c:299 ext4_file_write_iter+0x892/0x1c50 iter_file_splice_write+0xbfc/0x1510 fs/splice.c:743 do_splice_from fs/splice.c:941 [inline] direct_splice_actor+0x11d/0x220 fs/splice.c:1164 splice_direct_to_actor+0x588/0xc80 fs/splice.c:1108 do_splice_direct_actor fs/splice.c:1207 [inline] do_splice_direct+0x289/0x3e0 fs/splice.c:1233 do_sendfile+0x564/0x8a0 fs/read_write.c:1363 __do_sys_sendfile64 fs/read_write.c:1424 [inline] __se_sys_sendfile64+0x17c/0x1e0 fs/read_write.c:1410 do_syscall_x64 arch/x86/entry/common.c:52 [inline] do_syscall_64+0xf3/0x230 arch/x86/entry/common.c:83 entry_SYSCALL_64_after_hwframe+0x77/0x7f RIP: 0033:0x7f5c6bb18d09 Code: 28 00 00 00 75 05 48 83 c4 28 c3 e8 b1 18 00 00 90 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 b0 ff ff ff f7 d8 64 89 01 48 RSP: 002b:00007f5c6bacf218 EFLAGS: 00000246 ORIG_RAX: 0000000000000028 RAX: ffffffffffffffda RBX: 00007f5c6bba0708 RCX: 00007f5c6bb18d09 RDX: 0000000000000000 RSI: 0000000000000005 RDI: 0000000000000004 RBP: 00007f5c6bba0700 R08: 0000000000000000 R09: 0000000000000000 R10: 000080001d00c0d0 R11: 0000000000000246 R12: 00007f5c6bb6d620 R13: 00007f5c6bb6d0c0 R14: 0031656c69662f2e R15: 8088e3ad122bc192 </TASK> Modules linked in: ---[ end trace 0000000000000000 ]---
This happens because ext4_page_mkwrite is not protected by the inode_lock. The xattr semaphore is not sufficient to protect inline data handling in a sane way, so we need to rely on the inode_lock. Adding the inode_lock to ext4_page_mkwrite is not an option, otherwise lock-ordering problems with mmap_lock may arise.
The conversion inside ext4_page_mkwrite was introduced at commit 7b4cc9787fe3 ("ext4: evict inline data when writing to memory map"). This fixes a documented bug in the commit message, which suggests some alternatives fixes.
The alternative of keeping the data inline is left as an exercise for the reader.
Instead, block allocation still happens, just as it does right now. But removal of the inline data is only done when pages are written back.
Fixes: 7b4cc9787fe3 ("ext4: evict inline data when writing to memory map") Reported-by: syzbot+0c89d865531d053abb2d@syzkaller.appspotmail.com Closes: https://syzkaller.appspot.com/bug?extid=0c89d865531d053abb2d Signed-off-by: Thadeu Lima de Souza Cascardo cascardo@igalia.com Cc: stable@vger.kernel.org --- fs/ext4/inode.c | 7 +------ 1 file changed, 1 insertion(+), 6 deletions(-)
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c index 94c7d2d828a64e42ded09c82497ed7617071aa19..38653d5c8b32ede2b130285ab13ef1e96b1ba783 100644 --- a/fs/ext4/inode.c +++ b/fs/ext4/inode.c @@ -2620,8 +2620,7 @@ static int ext4_do_writepages(struct mpage_da_data *mpd) ret = PTR_ERR(handle); goto out_writepages; } - BUG_ON(ext4_test_inode_state(inode, - EXT4_STATE_MAY_INLINE_DATA)); + ext4_clear_inode_state(inode, EXT4_STATE_MAY_INLINE_DATA); ext4_destroy_inline_data(handle, inode); ext4_journal_stop(handle); } @@ -6222,10 +6221,6 @@ vm_fault_t ext4_page_mkwrite(struct vm_fault *vmf)
filemap_invalidate_lock_shared(mapping);
- err = ext4_convert_inline_data(inode); - if (err) - goto out_ret; - /* * On data journalling we skip straight to the transaction handle: * there's no delalloc; page truncated will be checked later; the
--- base-commit: a5806cd506af5a7c19bcd596e4708b5c464bfd21 change-id: 20250519-ext4_inline_page_mkwrite-c42ca1f02295
Best regards,
On Mon, May 19, 2025 at 07:42:46AM -0300, Thadeu Lima de Souza Cascardo wrote:
inline data handling has a race between writing and writing to a memory map.
When ext4_page_mkwrite is called, it calls ext4_convert_inline_data, which destroys the inline data, but if block allocation fails, restores the inline data. In that process, we could have:
CPU1 CPU2 destroy_inline_data write_begin (does not see inline data) restory_inline_data write_end (sees inline data)
The conversion inside ext4_page_mkwrite was introduced at commit 7b4cc9787fe3 ("ext4: evict inline data when writing to memory map"). This fixes a documented bug in the commit message, which suggests some alternatives fixes.
Your fix just reverts commit 7b4cc9787fe3, and removes the BUG_ON. While this is great for shutting up the syzbot report, but it causes file writes to an inline data file via a mmap to never get written back to the storage device. So you are replacing BUG_ON that can get triggered on a race condition in case of a failed block allocation, with silent data corruption. This is not an improvement.
Thanks for trying to address this, but I'm not going to accept your proposed fix.
- Ted
On Tue, May 20, 2025 at 10:57:08AM -0400, Theodore Ts'o wrote:
On Mon, May 19, 2025 at 07:42:46AM -0300, Thadeu Lima de Souza Cascardo wrote:
inline data handling has a race between writing and writing to a memory map.
When ext4_page_mkwrite is called, it calls ext4_convert_inline_data, which destroys the inline data, but if block allocation fails, restores the inline data. In that process, we could have:
CPU1 CPU2 destroy_inline_data write_begin (does not see inline data) restory_inline_data write_end (sees inline data)
The conversion inside ext4_page_mkwrite was introduced at commit 7b4cc9787fe3 ("ext4: evict inline data when writing to memory map"). This fixes a documented bug in the commit message, which suggests some alternatives fixes.
Your fix just reverts commit 7b4cc9787fe3, and removes the BUG_ON. While this is great for shutting up the syzbot report, but it causes file writes to an inline data file via a mmap to never get written back to the storage device. So you are replacing BUG_ON that can get triggered on a race condition in case of a failed block allocation, with silent data corruption. This is not an improvement.
Thanks for trying to address this, but I'm not going to accept your proposed fix.
- Ted
Hi, Ted.
I am trying to understand better the circumstances where the data loss might occur with the fix, but might not occur without the fix. Or, even if they occur either way, such that I can work on a better/proper fix.
Right now, if ext4_convert_inline_data (called from ext4_page_mkwrite) fails with ENOSPC, the memory access will lead to a SIGBUS. The same will happen without the fix, if there are no blocks available.
Now, without ext4_convert_inline_data, blocks will be allocated by ext4_page_mkwrite and written by ext4_do_writepages. Are you concerned about a failure between the clearing of the inode data and the writing of the block in ext4_do_writepages?
Or are you concerned about a potential race condition when allocating blocks?
Which of these cannot happen today with the code as is? If I understand correctly, the inline conversion code also calls ext4_destroy_inline_data before allocating and writing to blocks.
Thanks a lot for the review and guidance. Cascardo.
On Wed 21-05-25 18:52:03, Thadeu Lima de Souza Cascardo wrote:
On Tue, May 20, 2025 at 10:57:08AM -0400, Theodore Ts'o wrote:
On Mon, May 19, 2025 at 07:42:46AM -0300, Thadeu Lima de Souza Cascardo wrote:
inline data handling has a race between writing and writing to a memory map.
When ext4_page_mkwrite is called, it calls ext4_convert_inline_data, which destroys the inline data, but if block allocation fails, restores the inline data. In that process, we could have:
CPU1 CPU2 destroy_inline_data write_begin (does not see inline data) restory_inline_data write_end (sees inline data)
The conversion inside ext4_page_mkwrite was introduced at commit 7b4cc9787fe3 ("ext4: evict inline data when writing to memory map"). This fixes a documented bug in the commit message, which suggests some alternatives fixes.
Your fix just reverts commit 7b4cc9787fe3, and removes the BUG_ON. While this is great for shutting up the syzbot report, but it causes file writes to an inline data file via a mmap to never get written back to the storage device. So you are replacing BUG_ON that can get triggered on a race condition in case of a failed block allocation, with silent data corruption. This is not an improvement.
Thanks for trying to address this, but I'm not going to accept your proposed fix.
- Ted
Hi, Ted.
I am trying to understand better the circumstances where the data loss might occur with the fix, but might not occur without the fix. Or, even if they occur either way, such that I can work on a better/proper fix.
Right now, if ext4_convert_inline_data (called from ext4_page_mkwrite) fails with ENOSPC, the memory access will lead to a SIGBUS. The same will happen without the fix, if there are no blocks available.
Now, without ext4_convert_inline_data, blocks will be allocated by ext4_page_mkwrite and written by ext4_do_writepages. Are you concerned about a failure between the clearing of the inode data and the writing of the block in ext4_do_writepages?
Or are you concerned about a potential race condition when allocating blocks?
Which of these cannot happen today with the code as is? If I understand correctly, the inline conversion code also calls ext4_destroy_inline_data before allocating and writing to blocks.
Thanks a lot for the review and guidance.
So I'm not sure what Ted was exactly worried about because writeback code should normally allocate underlying blocks for writeout of the mmaped page AFAICT. But the problem I can see is that clearing EXT4_STATE_MAY_INLINE_DATA requires i_rwsem held as otherwise we may be racing with e.g. write(2) and switching EXT4_STATE_MAY_INLINE_DATA in the middle of the write will cause bad things (inconsistency between how write_begin() and write_end() callbacks behave).
Honza
On Mon, May 26, 2025 at 03:43:31PM +0200, Jan Kara wrote:
On Wed 21-05-25 18:52:03, Thadeu Lima de Souza Cascardo wrote:
On Tue, May 20, 2025 at 10:57:08AM -0400, Theodore Ts'o wrote:
On Mon, May 19, 2025 at 07:42:46AM -0300, Thadeu Lima de Souza Cascardo wrote:
inline data handling has a race between writing and writing to a memory map.
When ext4_page_mkwrite is called, it calls ext4_convert_inline_data, which destroys the inline data, but if block allocation fails, restores the inline data. In that process, we could have:
CPU1 CPU2 destroy_inline_data write_begin (does not see inline data) restory_inline_data write_end (sees inline data)
The conversion inside ext4_page_mkwrite was introduced at commit 7b4cc9787fe3 ("ext4: evict inline data when writing to memory map"). This fixes a documented bug in the commit message, which suggests some alternatives fixes.
Your fix just reverts commit 7b4cc9787fe3, and removes the BUG_ON. While this is great for shutting up the syzbot report, but it causes file writes to an inline data file via a mmap to never get written back to the storage device. So you are replacing BUG_ON that can get triggered on a race condition in case of a failed block allocation, with silent data corruption. This is not an improvement.
Thanks for trying to address this, but I'm not going to accept your proposed fix.
- Ted
Hi, Ted.
I am trying to understand better the circumstances where the data loss might occur with the fix, but might not occur without the fix. Or, even if they occur either way, such that I can work on a better/proper fix.
Right now, if ext4_convert_inline_data (called from ext4_page_mkwrite) fails with ENOSPC, the memory access will lead to a SIGBUS. The same will happen without the fix, if there are no blocks available.
Now, without ext4_convert_inline_data, blocks will be allocated by ext4_page_mkwrite and written by ext4_do_writepages. Are you concerned about a failure between the clearing of the inode data and the writing of the block in ext4_do_writepages?
Or are you concerned about a potential race condition when allocating blocks?
Which of these cannot happen today with the code as is? If I understand correctly, the inline conversion code also calls ext4_destroy_inline_data before allocating and writing to blocks.
Thanks a lot for the review and guidance.
So I'm not sure what Ted was exactly worried about because writeback code should normally allocate underlying blocks for writeout of the mmaped page AFAICT. But the problem I can see is that clearing EXT4_STATE_MAY_INLINE_DATA requires i_rwsem held as otherwise we may be racing with e.g. write(2) and switching EXT4_STATE_MAY_INLINE_DATA in the middle of the write will cause bad things (inconsistency between how write_begin() and write_end() callbacks behave).
Honza
-- Jan Kara jack@suse.com SUSE Labs, CR
Thanks, Jan.
I later noticed as well that writepages is not holding the inode lock either, so there would be a potential for race condition there as well.
I have sent a v2 that I find would not have this problem. But we should probably cleanup the handling of inline data in writepages as a followup.
Cascardo.
linux-stable-mirror@lists.linaro.org