On Mon 26-05-25 11:13:34, Thadeu Lima de Souza Cascardo wrote:
inline data handling has a race between writing and writing to a memory map.
When ext4_page_mkwrite is called, it calls ext4_convert_inline_data, which destroys the inline data, but if block allocation fails, restores the inline data. In that process, we could have:
CPU1 CPU2 destroy_inline_data write_begin (does not see inline data) restory_inline_data write_end (sees inline data)
This leads to bugs like the one below, as write_begin did not prepare for the case of inline data, which is expected by the write_end side of it.
------------[ cut here ]------------ kernel BUG at fs/ext4/inline.c:235! Oops: invalid opcode: 0000 [#1] PREEMPT SMP KASAN NOPTI CPU: 1 UID: 0 PID: 5838 Comm: syz-executor110 Not tainted 6.13.0-rc3-syzkaller-00209-g499551201b5f #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 09/13/2024 RIP: 0010:ext4_write_inline_data fs/ext4/inline.c:235 [inline] RIP: 0010:ext4_write_inline_data_end+0xdc7/0xdd0 fs/ext4/inline.c:774 Code: 47 1d 8c e8 4b 3a 91 ff 90 0f 0b e8 63 7a 47 ff 48 8b 7c 24 10 48 c7 c6 e0 47 1d 8c e8 32 3a 91 ff 90 0f 0b e8 4a 7a 47 ff 90 <0f> 0b 0f 1f 80 00 00 00 00 90 90 90 90 90 90 90 90 90 90 90 90 90 RSP: 0018:ffffc900031c7320 EFLAGS: 00010293 RAX: ffffffff8257f9a6 RBX: 000000000000005a RCX: ffff888012968000 RDX: 0000000000000000 RSI: 000000000000005a RDI: 000000000000005b RBP: ffffc900031c7448 R08: ffffffff8257ef87 R09: 1ffff11006806070 R10: dffffc0000000000 R11: ffffed1006806071 R12: 000000000000005a R13: dffffc0000000000 R14: ffff888076b65bd8 R15: 000000000000005b FS: 00007f5c6bacf6c0(0000) GS:ffff8880b8700000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000000020000a00 CR3: 0000000073fb6000 CR4: 0000000000350ef0 Call Trace:
<TASK> generic_perform_write+0x6f8/0x990 mm/filemap.c:4070 ext4_buffered_write_iter+0xc5/0x350 fs/ext4/file.c:299 ext4_file_write_iter+0x892/0x1c50 iter_file_splice_write+0xbfc/0x1510 fs/splice.c:743 do_splice_from fs/splice.c:941 [inline] direct_splice_actor+0x11d/0x220 fs/splice.c:1164 splice_direct_to_actor+0x588/0xc80 fs/splice.c:1108 do_splice_direct_actor fs/splice.c:1207 [inline] do_splice_direct+0x289/0x3e0 fs/splice.c:1233 do_sendfile+0x564/0x8a0 fs/read_write.c:1363 __do_sys_sendfile64 fs/read_write.c:1424 [inline] __se_sys_sendfile64+0x17c/0x1e0 fs/read_write.c:1410 do_syscall_x64 arch/x86/entry/common.c:52 [inline] do_syscall_64+0xf3/0x230 arch/x86/entry/common.c:83 entry_SYSCALL_64_after_hwframe+0x77/0x7f RIP: 0033:0x7f5c6bb18d09 Code: 28 00 00 00 75 05 48 83 c4 28 c3 e8 b1 18 00 00 90 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 b0 ff ff ff f7 d8 64 89 01 48 RSP: 002b:00007f5c6bacf218 EFLAGS: 00000246 ORIG_RAX: 0000000000000028 RAX: ffffffffffffffda RBX: 00007f5c6bba0708 RCX: 00007f5c6bb18d09 RDX: 0000000000000000 RSI: 0000000000000005 RDI: 0000000000000004 RBP: 00007f5c6bba0700 R08: 0000000000000000 R09: 0000000000000000 R10: 000080001d00c0d0 R11: 0000000000000246 R12: 00007f5c6bb6d620 R13: 00007f5c6bb6d0c0 R14: 0031656c69662f2e R15: 8088e3ad122bc192 </TASK> Modules linked in: ---[ end trace 0000000000000000 ]---
This happens because ext4_page_mkwrite is not protected by the inode_lock. The xattr semaphore is not sufficient to protect inline data handling in a sane way, so we need to rely on the inode_lock. Adding the inode_lock to ext4_page_mkwrite is not an option, otherwise lock-ordering problems with mmap_lock may arise.
The conversion inside ext4_page_mkwrite was introduced at commit 7b4cc9787fe3 ("ext4: evict inline data when writing to memory map"). This fixes a documented bug in the commit message, which suggests some alternative fixes.
Convert inline data when mmap is called, instead of doing it only when the mmapped page is written to. Using the inode_lock there does not lead to lock-ordering issues.
The drawback is that inline conversion will happen when the file is mmapped, even though the page will not be written to.
Fixes: 7b4cc9787fe3 ("ext4: evict inline data when writing to memory map") Reported-by: syzbot+0c89d865531d053abb2d@syzkaller.appspotmail.com Closes: https://syzkaller.appspot.com/bug?extid=0c89d865531d053abb2d Signed-off-by: Thadeu Lima de Souza Cascardo cascardo@igalia.com Cc: stable@vger.kernel.org
Changes in v2:
- Convert inline data at mmap time, avoiding data loss.
- Link to v1: https://lore.kernel.org/r/20250519-ext4_inline_page_mkwrite-v1-1-865d9a62b51...
fs/ext4/file.c | 6 ++++++ fs/ext4/inode.c | 4 ---- 2 files changed, 6 insertions(+), 4 deletions(-)
diff --git a/fs/ext4/file.c b/fs/ext4/file.c index beb078ee4811d6092e362e37307e7d87e5276cbc..f2380471df5d99500e49fdc639fa3e56143c328f 100644 --- a/fs/ext4/file.c +++ b/fs/ext4/file.c @@ -819,6 +819,12 @@ static int ext4_file_mmap(struct file *file, struct vm_area_struct *vma) if (!daxdev_mapping_supported(vma, dax_dev)) return -EOPNOTSUPP;
- inode_lock(inode);
- ret = ext4_convert_inline_data(inode);
- inode_unlock(inode);
- if (ret)
return ret;
- file_accessed(file); if (IS_DAX(file_inode(file))) { vma->vm_ops = &ext4_dax_vm_ops;
So I would *love* to do this and was thinking about this as well. But the trouble is that this causes lock inversion as well because ->mmap callback is called with mmap_lock held and so we cannot acquire inode_lock here either.
Recent changes which switch from ->mmap to ->mmap_prepare callback are actually going in a suitable direction but we'd need a rather larger rewrite to get from under mmap_lock and I'm not sure that's justified.
Anyway, thanks for having a look!
Honza