Hi Greg,
This backport series contains small fixes from v5.15 release.
From this point on, 5.10.y xfs can follow and pick changes posted to 5.15.y.
I already have some debt of fixes from v5.17 already applied to 5.15.y, but not yet submitted to 5.10.y - those will be included in my next batch.
Thanks, Amir.
Changes from [v1]: - Drop backport that disallows disabling of quota accounting on a mounted xfs (Darrick) - Added Acked-by Darrick - CC stable
[v1] https://lore.kernel.org/linux-xfs/20220809111708.92768-1-amir73il@gmail.com/
Darrick J. Wong (1): xfs: only set IOMAP_F_SHARED when providing a srcmap to a write
Dave Chinner (2): mm: Add kvrealloc() xfs: fix I_DONTCACHE
fs/xfs/xfs_icache.c | 3 ++- fs/xfs/xfs_iomap.c | 8 ++++---- fs/xfs/xfs_iops.c | 2 +- fs/xfs/xfs_log_recover.c | 4 +++- include/linux/mm.h | 2 ++ mm/util.c | 15 +++++++++++++++ 6 files changed, 27 insertions(+), 7 deletions(-)
From: Dave Chinner dchinner@redhat.com
commit de2860f4636256836450c6543be744a50118fc66 upstream.
During log recovery of an XFS filesystem with 64kB directory buffers, rebuilding a buffer split across two log records results in a memory allocation warning from krealloc like this:
xfs filesystem being mounted at /mnt/scratch supports timestamps until 2038 (0x7fffffff) XFS (dm-0): Unmounting Filesystem XFS (dm-0): Mounting V5 Filesystem XFS (dm-0): Starting recovery (logdev: internal) ------------[ cut here ]------------ WARNING: CPU: 5 PID: 3435170 at mm/page_alloc.c:3539 get_page_from_freelist+0xdee/0xe40 ..... RIP: 0010:get_page_from_freelist+0xdee/0xe40 Call Trace: ? complete+0x3f/0x50 __alloc_pages+0x16f/0x300 alloc_pages+0x87/0x110 kmalloc_order+0x2c/0x90 kmalloc_order_trace+0x1d/0x90 __kmalloc_track_caller+0x215/0x270 ? xlog_recover_add_to_cont_trans+0x63/0x1f0 krealloc+0x54/0xb0 xlog_recover_add_to_cont_trans+0x63/0x1f0 xlog_recovery_process_trans+0xc1/0xd0 xlog_recover_process_ophdr+0x86/0x130 xlog_recover_process_data+0x9f/0x160 xlog_recover_process+0xa2/0x120 xlog_do_recovery_pass+0x40b/0x7d0 ? __irq_work_queue_local+0x4f/0x60 ? irq_work_queue+0x3a/0x50 xlog_do_log_recovery+0x70/0x150 xlog_do_recover+0x38/0x1d0 xlog_recover+0xd8/0x170 xfs_log_mount+0x181/0x300 xfs_mountfs+0x4a1/0x9b0 xfs_fs_fill_super+0x3c0/0x7b0 get_tree_bdev+0x171/0x270 ? suffix_kstrtoint.constprop.0+0xf0/0xf0 xfs_fs_get_tree+0x15/0x20 vfs_get_tree+0x24/0xc0 path_mount+0x2f5/0xaf0 __x64_sys_mount+0x108/0x140 do_syscall_64+0x3a/0x70 entry_SYSCALL_64_after_hwframe+0x44/0xae
Essentially, we are taking a multi-order allocation from kmem_alloc() (which has an open coded no fail, no warn loop) and then reallocating it out to 64kB using krealloc(__GFP_NOFAIL) and that is then triggering the above warning.
This is a regression caused by converting this code from an open coded no fail/no warn reallocation loop to using __GFP_NOFAIL.
What we actually need here is kvrealloc(), so that if contiguous page allocation fails we fall back to vmalloc() and we don't get nasty warnings happening in XFS.
Fixes: 771915c4f688 ("xfs: remove kmem_realloc()") Signed-off-by: Dave Chinner dchinner@redhat.com Acked-by: Mel Gorman mgorman@techsingularity.net Reviewed-by: Darrick J. Wong djwong@kernel.org Signed-off-by: Darrick J. Wong djwong@kernel.org Signed-off-by: Amir Goldstein amir73il@gmail.com Acked-by: Darrick J. Wong djwong@kernel.org --- fs/xfs/xfs_log_recover.c | 4 +++- include/linux/mm.h | 2 ++ mm/util.c | 15 +++++++++++++++ 3 files changed, 20 insertions(+), 1 deletion(-)
diff --git a/fs/xfs/xfs_log_recover.c b/fs/xfs/xfs_log_recover.c index 69408782019e..e61f28ce3e44 100644 --- a/fs/xfs/xfs_log_recover.c +++ b/fs/xfs/xfs_log_recover.c @@ -2061,7 +2061,9 @@ xlog_recover_add_to_cont_trans( old_ptr = item->ri_buf[item->ri_cnt-1].i_addr; old_len = item->ri_buf[item->ri_cnt-1].i_len;
- ptr = krealloc(old_ptr, len + old_len, GFP_KERNEL | __GFP_NOFAIL); + ptr = kvrealloc(old_ptr, old_len, len + old_len, GFP_KERNEL); + if (!ptr) + return -ENOMEM; memcpy(&ptr[old_len], dp, len); item->ri_buf[item->ri_cnt-1].i_len += len; item->ri_buf[item->ri_cnt-1].i_addr = ptr; diff --git a/include/linux/mm.h b/include/linux/mm.h index 5b4d88faf114..b8b677f47a8d 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -788,6 +788,8 @@ static inline void *kvcalloc(size_t n, size_t size, gfp_t flags) return kvmalloc_array(n, size, flags | __GFP_ZERO); }
+extern void *kvrealloc(const void *p, size_t oldsize, size_t newsize, + gfp_t flags); extern void kvfree(const void *addr); extern void kvfree_sensitive(const void *addr, size_t len);
diff --git a/mm/util.c b/mm/util.c index ba9643de689e..25bfda774f6f 100644 --- a/mm/util.c +++ b/mm/util.c @@ -661,6 +661,21 @@ void kvfree_sensitive(const void *addr, size_t len) } EXPORT_SYMBOL(kvfree_sensitive);
+void *kvrealloc(const void *p, size_t oldsize, size_t newsize, gfp_t flags) +{ + void *newp; + + if (oldsize >= newsize) + return (void *)p; + newp = kvmalloc(newsize, flags); + if (!newp) + return NULL; + memcpy(newp, p, oldsize); + kvfree(p); + return newp; +} +EXPORT_SYMBOL(kvrealloc); + static inline void *__page_rmapping(struct page *page) { unsigned long mapping;
From: "Darrick J. Wong" djwong@kernel.org
commit 72a048c1056a72e37ea2ee34cc73d8c6d6cb4290 upstream.
While prototyping a free space defragmentation tool, I observed an unexpected IO error while running a sequence of commands that can be recreated by the following sequence of commands:
$ xfs_io -f -c "pwrite -S 0x58 -b 10m 0 10m" file1 $ cp --reflink=always file1 file2 $ punch-alternating -o 1 file2 $ xfs_io -c "funshare 0 10m" file2 fallocate: Input/output error
I then scraped this (abbreviated) stack trace from dmesg:
WARNING: CPU: 0 PID: 30788 at fs/iomap/buffered-io.c:577 iomap_write_begin+0x376/0x450 CPU: 0 PID: 30788 Comm: xfs_io Not tainted 5.14.0-rc6-xfsx #rc6 5ef57b62a900814b3e4d885c755e9014541c8732 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.13.0-1ubuntu1.1 04/01/2014 RIP: 0010:iomap_write_begin+0x376/0x450 RSP: 0018:ffffc90000c0fc20 EFLAGS: 00010297 RAX: 0000000000000001 RBX: ffffc90000c0fd10 RCX: 0000000000001000 RDX: ffffc90000c0fc54 RSI: 000000000000000c RDI: 000000000000000c RBP: ffff888005d5dbd8 R08: 0000000000102000 R09: ffffc90000c0fc50 R10: 0000000000b00000 R11: 0000000000101000 R12: ffffea0000336c40 R13: 0000000000001000 R14: ffffc90000c0fd10 R15: 0000000000101000 FS: 00007f4b8f62fe40(0000) GS:ffff88803ec00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 000056361c554108 CR3: 000000000524e004 CR4: 00000000001706f0 Call Trace: iomap_unshare_actor+0x95/0x140 iomap_apply+0xfa/0x300 iomap_file_unshare+0x44/0x60 xfs_reflink_unshare+0x50/0x140 [xfs 61947ea9b3a73e79d747dbc1b90205e7987e4195] xfs_file_fallocate+0x27c/0x610 [xfs 61947ea9b3a73e79d747dbc1b90205e7987e4195] vfs_fallocate+0x133/0x330 __x64_sys_fallocate+0x3e/0x70 do_syscall_64+0x35/0x80 entry_SYSCALL_64_after_hwframe+0x44/0xae RIP: 0033:0x7f4b8f79140a
Looking at the iomap tracepoints, I saw this:
iomap_iter: dev 8:64 ino 0x100 pos 0 length 0 flags WRITE|0x80 (0x81) ops xfs_buffered_write_iomap_ops caller iomap_file_unshare iomap_iter_dstmap: dev 8:64 ino 0x100 bdev 8:64 addr -1 offset 0 length 131072 type DELALLOC flags SHARED iomap_iter_srcmap: dev 8:64 ino 0x100 bdev 8:64 addr 147456 offset 0 length 4096 type MAPPED flags iomap_iter: dev 8:64 ino 0x100 pos 0 length 4096 flags WRITE|0x80 (0x81) ops xfs_buffered_write_iomap_ops caller iomap_file_unshare iomap_iter_dstmap: dev 8:64 ino 0x100 bdev 8:64 addr -1 offset 4096 length 4096 type DELALLOC flags SHARED console: WARNING: CPU: 0 PID: 30788 at fs/iomap/buffered-io.c:577 iomap_write_begin+0x376/0x450
The first time funshare calls ->iomap_begin, xfs sees that the first block is shared and creates a 128k delalloc reservation in the COW fork. The delalloc reservation is returned as dstmap, and the shared block is returned as srcmap. So far so good.
funshare calls ->iomap_begin to try the second block. This time there's no srcmap (punch-alternating punched it out!) but we still have the delalloc reservation in the COW fork. Therefore, we again return the reservation as dstmap and the hole as srcmap. iomap_unshare_iter incorrectly tries to unshare the hole, which __iomap_write_begin rejects because shared regions must be fully written and therefore cannot require zeroing.
Therefore, change the buffered write iomap_begin function not to set IOMAP_F_SHARED when there isn't a source mapping to read from for the unsharing.
Signed-off-by: Darrick J. Wong djwong@kernel.org Reviewed-by: Christoph Hellwig hch@lst.de Reviewed-by: Chandan Babu R chandanrlinux@gmail.com Signed-off-by: Amir Goldstein amir73il@gmail.com Acked-by: Darrick J. Wong djwong@kernel.org --- fs/xfs/xfs_iomap.c | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-)
diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c index 74bc2beadc23..bd5a25f4952d 100644 --- a/fs/xfs/xfs_iomap.c +++ b/fs/xfs/xfs_iomap.c @@ -1062,11 +1062,11 @@ xfs_buffered_write_iomap_begin( error = xfs_bmbt_to_iomap(ip, srcmap, &imap, 0); if (error) return error; - } else { - xfs_trim_extent(&cmap, offset_fsb, - imap.br_startoff - offset_fsb); + return xfs_bmbt_to_iomap(ip, iomap, &cmap, IOMAP_F_SHARED); } - return xfs_bmbt_to_iomap(ip, iomap, &cmap, IOMAP_F_SHARED); + + xfs_trim_extent(&cmap, offset_fsb, imap.br_startoff - offset_fsb); + return xfs_bmbt_to_iomap(ip, iomap, &cmap, 0);
out_unlock: xfs_iunlock(ip, XFS_ILOCK_EXCL);
From: Dave Chinner dchinner@redhat.com
commit f38a032b165d812b0ba8378a5cd237c0888ff65f upstream.
Yup, the VFS hoist broke it, and nobody noticed. Bulkstat workloads make it clear that it doesn't work as it should.
Fixes: dae2f8ed7992 ("fs: Lift XFS_IDONTCACHE to the VFS layer") Signed-off-by: Dave Chinner dchinner@redhat.com Reviewed-by: Darrick J. Wong djwong@kernel.org Signed-off-by: Darrick J. Wong djwong@kernel.org Signed-off-by: Amir Goldstein amir73il@gmail.com Acked-by: Darrick J. Wong djwong@kernel.org --- fs/xfs/xfs_icache.c | 3 ++- fs/xfs/xfs_iops.c | 2 +- 2 files changed, 3 insertions(+), 2 deletions(-)
diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c index deb99300d171..e69a08ed7de4 100644 --- a/fs/xfs/xfs_icache.c +++ b/fs/xfs/xfs_icache.c @@ -47,8 +47,9 @@ xfs_inode_alloc( return NULL; }
- /* VFS doesn't initialise i_mode! */ + /* VFS doesn't initialise i_mode or i_state! */ VFS_I(ip)->i_mode = 0; + VFS_I(ip)->i_state = 0;
XFS_STATS_INC(mp, vn_active); ASSERT(atomic_read(&ip->i_pincount) == 0); diff --git a/fs/xfs/xfs_iops.c b/fs/xfs/xfs_iops.c index b7f7b31a77d5..6a3026e78a9b 100644 --- a/fs/xfs/xfs_iops.c +++ b/fs/xfs/xfs_iops.c @@ -1328,7 +1328,7 @@ xfs_setup_inode( gfp_t gfp_mask;
inode->i_ino = ip->i_ino; - inode->i_state = I_NEW; + inode->i_state |= I_NEW;
inode_sb_list_add(inode); /* make the inode look hashed for the writeback code */
linux-stable-mirror@lists.linaro.org