Both f2fs and ext4 end up passing the ciphertext page to wbc_account_cgroup_owner(). At the moment, the ciphertext page appears to belong to no cgroup, so it is accounted to the root_mem_cgroup instead of whatever cgroup the original page was in.
It's hard to say how far back this is a bug. The crypto code shared between ext4 & f2fs was created in May 2015 with commit 0b81d0779072, but neither filesystem did anything with memcg_data before then. memcg writeback accounting was added to ext4 in July 2015 in commit 001e4a8775f6 and it wasn't added to f2fs until January 2018 (commit 578c647879f7).
I'm going with the ext4 commit since this is the first commit where there was a difference in behaviour between encrypted and unencrypted filesystems.
Fixes: 001e4a8775f6 ("ext4: implement cgroup writeback support") Cc: stable@vger.kernel.org Signed-off-by: Matthew Wilcox (Oracle) willy@infradead.org --- fs/crypto/crypto.c | 3 +++ 1 file changed, 3 insertions(+)
diff --git a/fs/crypto/crypto.c b/fs/crypto/crypto.c index e78be66bbf01..a4e76f96f291 100644 --- a/fs/crypto/crypto.c +++ b/fs/crypto/crypto.c @@ -205,6 +205,9 @@ struct page *fscrypt_encrypt_pagecache_blocks(struct page *page, } SetPagePrivate(ciphertext_page); set_page_private(ciphertext_page, (unsigned long)page); +#ifdef CONFIG_MEMCG + ciphertext_page->memcg_data = page->memcg_data; +#endif return ciphertext_page; } EXPORT_SYMBOL(fscrypt_encrypt_pagecache_blocks);
On Sun, Jan 29, 2023 at 12:18:51PM +0000, Matthew Wilcox (Oracle) wrote:
Both f2fs and ext4 end up passing the ciphertext page to wbc_account_cgroup_owner(). At the moment, the ciphertext page appears to belong to no cgroup, so it is accounted to the root_mem_cgroup instead of whatever cgroup the original page was in.
It's hard to say how far back this is a bug. The crypto code shared between ext4 & f2fs was created in May 2015 with commit 0b81d0779072, but neither filesystem did anything with memcg_data before then. memcg writeback accounting was added to ext4 in July 2015 in commit 001e4a8775f6 and it wasn't added to f2fs until January 2018 (commit 578c647879f7).
I'm going with the ext4 commit since this is the first commit where there was a difference in behaviour between encrypted and unencrypted filesystems.
Fixes: 001e4a8775f6 ("ext4: implement cgroup writeback support") Cc: stable@vger.kernel.org Signed-off-by: Matthew Wilcox (Oracle) willy@infradead.org
fs/crypto/crypto.c | 3 +++ 1 file changed, 3 insertions(+)
What is the actual effect of this bug?
The bounce pages are short-lived, so surely it doesn't really matter what memory cgroup they get charged to?
I guess it's really more about the effect on cgroup writeback? And that's also the reason why this is a problem here but not e.g. in dm-crypt?
diff --git a/fs/crypto/crypto.c b/fs/crypto/crypto.c index e78be66bbf01..a4e76f96f291 100644 --- a/fs/crypto/crypto.c +++ b/fs/crypto/crypto.c @@ -205,6 +205,9 @@ struct page *fscrypt_encrypt_pagecache_blocks(struct page *page, } SetPagePrivate(ciphertext_page); set_page_private(ciphertext_page, (unsigned long)page); +#ifdef CONFIG_MEMCG
- ciphertext_page->memcg_data = page->memcg_data;
+#endif return ciphertext_page; }
Nothing outside mm/ and include/linux/memcontrol.h does anything with memcg_data directly. Are you sure this is the right thing to do here?
Also, this patch causes the following:
[ 16.192276] BUG: Bad page state in process kworker/u4:2 pfn:10798a [ 16.192919] page:00000000332f5565 refcount:0 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x10798a [ 16.193848] memcg:ffff88810766c000 [ 16.194186] flags: 0x200000000000000(node=0|zone=2) [ 16.194642] raw: 0200000000000000 0000000000000000 dead000000000122 0000000000000000 [ 16.195356] raw: 0000000000000000 0000000000000000 00000000ffffffff ffff88810766c000 [ 16.196061] page dumped because: page still charged to cgroup [ 16.196599] CPU: 0 PID: 33 Comm: kworker/u4:2 Tainted: G T 6.2.0-rc5-00001-gf84eecbf5db1 #3 [ 16.197494] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS Arch Linux 1.16.1-1-1 04/01/2014 [ 16.198343] Workqueue: ext4-rsv-conversion ext4_end_io_rsv_work [ 16.198899] Call Trace: [ 16.199143] <TASK> [ 16.199350] show_stack+0x47/0x56 [ 16.199670] dump_stack_lvl+0x55/0x72 [ 16.200019] dump_stack+0x14/0x18 [ 16.200345] bad_page.cold+0x5e/0x8a [ 16.200685] free_page_is_bad_report+0x61/0x70 [ 16.201111] free_pcp_prepare+0x13f/0x290 [ 16.201486] free_unref_page+0x27/0x1f0 [ 16.201848] __free_pages+0xa0/0xc0 [ 16.202186] mempool_free_pages+0xd/0x20 [ 16.202556] mempool_free+0x28/0x90 [ 16.202889] fscrypt_free_bounce_page+0x26/0x40 [ 16.203322] ext4_finish_bio+0x1ed/0x240 [ 16.203690] ext4_release_io_end+0x4a/0x100 [ 16.204088] ext4_end_io_rsv_work+0xa8/0x1b0 [ 16.204492] process_one_work+0x27f/0x580 [ 16.204874] worker_thread+0x5a/0x3d0 [ 16.205229] ? process_one_work+0x580/0x580 [ 16.205621] kthread+0x102/0x130 [ 16.205929] ? kthread_exit+0x30/0x30 [ 16.206280] ret_from_fork+0x1f/0x30 [ 16.206620] </TASK>
On Sun, Jan 29, 2023 at 10:10:35AM -0800, Eric Biggers wrote:
On Sun, Jan 29, 2023 at 12:18:51PM +0000, Matthew Wilcox (Oracle) wrote:
Both f2fs and ext4 end up passing the ciphertext page to wbc_account_cgroup_owner(). At the moment, the ciphertext page appears to belong to no cgroup, so it is accounted to the root_mem_cgroup instead of whatever cgroup the original page was in.
It's hard to say how far back this is a bug. The crypto code shared between ext4 & f2fs was created in May 2015 with commit 0b81d0779072, but neither filesystem did anything with memcg_data before then. memcg writeback accounting was added to ext4 in July 2015 in commit 001e4a8775f6 and it wasn't added to f2fs until January 2018 (commit 578c647879f7).
What is the actual effect of this bug?
The bounce pages are short-lived, so surely it doesn't really matter what memory cgroup they get charged to?
Ah, we don't want to charge the _memory_ of the bounce pages to the cgroup. We want to charge the _I/O_ to the cgroup.
Looking at the original commits, the effect will be that if you have an unencrypted filesystem, writeback will be throttled according to the cgroup's rules, but if you have an encrypted filesystem, it will escape the cgroup I/O limits.
I guess it's really more about the effect on cgroup writeback? And that's also the reason why this is a problem here but not e.g. in dm-crypt?
I haven't looked at dm-crypt at all, but my assumption is that the fs charges the I/O of the pagecache page to the cgroup, and there's no need to do it again.
diff --git a/fs/crypto/crypto.c b/fs/crypto/crypto.c index e78be66bbf01..a4e76f96f291 100644 --- a/fs/crypto/crypto.c +++ b/fs/crypto/crypto.c @@ -205,6 +205,9 @@ struct page *fscrypt_encrypt_pagecache_blocks(struct page *page, } SetPagePrivate(ciphertext_page); set_page_private(ciphertext_page, (unsigned long)page); +#ifdef CONFIG_MEMCG
- ciphertext_page->memcg_data = page->memcg_data;
+#endif return ciphertext_page; }
Nothing outside mm/ and include/linux/memcontrol.h does anything with memcg_data directly. Are you sure this is the right thing to do here?
Nope ;-) Happy to hear from people who know more about cgroups than I do. Adding some more ccs.
Also, this patch causes the following:
Oops. Clearly memcg_data needs to be set to NULL before we free it.
Hello,
On Sun, Jan 29, 2023 at 09:26:57PM +0000, Matthew Wilcox wrote:
diff --git a/fs/crypto/crypto.c b/fs/crypto/crypto.c index e78be66bbf01..a4e76f96f291 100644 --- a/fs/crypto/crypto.c +++ b/fs/crypto/crypto.c @@ -205,6 +205,9 @@ struct page *fscrypt_encrypt_pagecache_blocks(struct page *page, } SetPagePrivate(ciphertext_page); set_page_private(ciphertext_page, (unsigned long)page); +#ifdef CONFIG_MEMCG
- ciphertext_page->memcg_data = page->memcg_data;
+#endif return ciphertext_page; }
Nothing outside mm/ and include/linux/memcontrol.h does anything with memcg_data directly. Are you sure this is the right thing to do here?
Nope ;-) Happy to hear from people who know more about cgroups than I do. Adding some more ccs.
Also, this patch causes the following:
Oops. Clearly memcg_data needs to be set to NULL before we free it.
These can usually be handled by explicitly associating the bio's to the desired cgroups using one of bio_associate_blkg*() or bio_clone_blkg_association(). It is possible to go through memcg ownership too using set_active_memcg() so that the page is owned by the target cgroup; however, the page ownership doesn't directly map to IO ownership as the relationship depends on the type of the page (e.g. IO ownership for pagecache writeback is determined per-inode, not per-page). If the in-flight pages are limited, it probably is better to set bio association directly.
Thanks.
On Tue, Jan 31, 2023 at 11:27:44AM -1000, Tejun Heo wrote:
Hello,
On Sun, Jan 29, 2023 at 09:26:57PM +0000, Matthew Wilcox wrote:
diff --git a/fs/crypto/crypto.c b/fs/crypto/crypto.c index e78be66bbf01..a4e76f96f291 100644 --- a/fs/crypto/crypto.c +++ b/fs/crypto/crypto.c @@ -205,6 +205,9 @@ struct page *fscrypt_encrypt_pagecache_blocks(struct page *page, } SetPagePrivate(ciphertext_page); set_page_private(ciphertext_page, (unsigned long)page); +#ifdef CONFIG_MEMCG
- ciphertext_page->memcg_data = page->memcg_data;
+#endif return ciphertext_page; }
Nothing outside mm/ and include/linux/memcontrol.h does anything with memcg_data directly. Are you sure this is the right thing to do here?
Nope ;-) Happy to hear from people who know more about cgroups than I do. Adding some more ccs.
Also, this patch causes the following:
Oops. Clearly memcg_data needs to be set to NULL before we free it.
These can usually be handled by explicitly associating the bio's to the desired cgroups using one of bio_associate_blkg*() or bio_clone_blkg_association().
Here that already happens in wbc_init_bio(), called from io_submit_init_bio() in fs/ext4/page-io.c.
It is possible to go through memcg ownership too using set_active_memcg() so that the page is owned by the target cgroup; however, the page ownership doesn't directly map to IO ownership as the relationship depends on the type of the page (e.g. IO ownership for pagecache writeback is determined per-inode, not per-page). If the in-flight pages are limited, it probably is better to set bio association directly.
ext4 also calls wbc_account_cgroup_owner() for each pagecache page that's written out. It seems this is for a different purpose -- it looks like the fs-writeback code is trying to figure out which cgroup "owns" the inode based on which cgroup "owns" most of the pagecache pages?
The bug we're discussing here is that when ext4 writes out a pagecache page in an encrypted file, it first encrypts the data into a bounce page, then passes the bounce page (which don't have a memcg) to wbc_account_cgroup_owner(). Maybe the proper fix is to just pass the pagecache page to wbc_account_cgroup_owner() instead? See below for ext4 (a separate patch would be needed for f2fs):
diff --git a/fs/ext4/page-io.c b/fs/ext4/page-io.c index beaec6d81074a..1e4db96a04e63 100644 --- a/fs/ext4/page-io.c +++ b/fs/ext4/page-io.c @@ -409,7 +409,8 @@ static void io_submit_init_bio(struct ext4_io_submit *io,
static void io_submit_add_bh(struct ext4_io_submit *io, struct inode *inode, - struct page *page, + struct page *pagecache_page, + struct page *bounce_page, struct buffer_head *bh) { int ret; @@ -421,10 +422,11 @@ static void io_submit_add_bh(struct ext4_io_submit *io, } if (io->io_bio == NULL) io_submit_init_bio(io, bh); - ret = bio_add_page(io->io_bio, page, bh->b_size, bh_offset(bh)); + ret = bio_add_page(io->io_bio, bounce_page ?: pagecache_page, + bh->b_size, bh_offset(bh)); if (ret != bh->b_size) goto submit_and_retry; - wbc_account_cgroup_owner(io->io_wbc, page, bh->b_size); + wbc_account_cgroup_owner(io->io_wbc, pagecache_page, bh->b_size); io->io_next_block++; }
@@ -561,8 +563,7 @@ int ext4_bio_write_page(struct ext4_io_submit *io, do { if (!buffer_async_write(bh)) continue; - io_submit_add_bh(io, inode, - bounce_page ? bounce_page : page, bh); + io_submit_add_bh(io, inode, page, bounce_page, bh); } while ((bh = bh->b_this_page) != head); unlock: unlock_page(page);
Hello,
On Tue, Jan 31, 2023 at 10:31:31PM -0800, Eric Biggers wrote:
These can usually be handled by explicitly associating the bio's to the desired cgroups using one of bio_associate_blkg*() or bio_clone_blkg_association().
Here that already happens in wbc_init_bio(), called from io_submit_init_bio() in fs/ext4/page-io.c.
Yeah, without bouncing, that's usually how writeback IOs are associated with their cgroups.
It is possible to go through memcg ownership too using set_active_memcg() so that the page is owned by the target cgroup; however, the page ownership doesn't directly map to IO ownership as the relationship depends on the type of the page (e.g. IO ownership for pagecache writeback is determined per-inode, not per-page). If the in-flight pages are limited, it probably is better to set bio association directly.
ext4 also calls wbc_account_cgroup_owner() for each pagecache page that's written out. It seems this is for a different purpose -- it looks like the fs-writeback code is trying to figure out which cgroup "owns" the inode based on which cgroup "owns" most of the pagecache pages?
Yeah, there's a difference between how memory and IO track cgroup ownership. Memory ownership is per-page but IO ownership is per-inode. This is because splitting writeback IOs of the same inode can perform really badly, so we try to find the majority dirty page owner cgroup of a given inode and associate the whole inode to that cgroup.
So, something like md / dm, which gets a bio from filesystem and then bounces it to another bio, would use either bio_clone_blkg_association() to copy the association of the original bio (which probably is set through wbc_init_bio()) or determine the cgroup the bio should belong to somehow and set it explicitly with bio_associate_blkg(). However, here, as the filesystem is the one bouncing I guess it can be simpler.
The bug we're discussing here is that when ext4 writes out a pagecache page in an encrypted file, it first encrypts the data into a bounce page, then passes the bounce page (which don't have a memcg) to wbc_account_cgroup_owner(). Maybe the proper fix is to just pass the pagecache page to wbc_account_cgroup_owner() instead? See below for ext4 (a separate patch would be needed for f2fs):
Yeah, this makes sense to me and is the right thing to do no matter what. wbc_account_cgroup_owner() should be fed the origin page so that the IO can be blamed on the owner of that page.
Thanks.
On Thu, Feb 02, 2023 at 11:30:42AM -1000, Tejun Heo wrote:
The bug we're discussing here is that when ext4 writes out a pagecache page in an encrypted file, it first encrypts the data into a bounce page, then passes the bounce page (which don't have a memcg) to wbc_account_cgroup_owner(). Maybe the proper fix is to just pass the pagecache page to wbc_account_cgroup_owner() instead? See below for ext4 (a separate patch would be needed for f2fs):
Yeah, this makes sense to me and is the right thing to do no matter what. wbc_account_cgroup_owner() should be fed the origin page so that the IO can be blamed on the owner of that page.
Thanks. These patches fix this for ext4 and f2fs:
* https://lore.kernel.org/r/20230203005503.141557-1-ebiggers@kernel.org * https://lore.kernel.org/r/20230203010239.216421-1-ebiggers@kernel.org
- Eric
Greeting,
FYI, we noticed BUG:Bad_page_state_in_process due to commit (built with gcc-11):
commit: fc581f48adffe8e6e2f1ae7822b004b0240602b3 ("[PATCH] fscrypt: Copy the memcg information to the ciphertext page") url: https://github.com/intel-lab-lkp/linux/commits/Matthew-Wilcox-Oracle/fscrypt... base: https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git c96618275234ad03d44eafe9f8844305bb44fda4 patch link: https://lore.kernel.org/all/20230129121851.2248378-1-willy@infradead.org/ patch subject: [PATCH] fscrypt: Copy the memcg information to the ciphertext page
in testcase: xfstests version: xfstests-x86_64-fb6575e-1_20230123 with following parameters:
disk: 4HDD fs: f2fs test: generic-group-22
test-description: xfstests is a regression test suite for xfs and other files ystems. test-url: git://git.kernel.org/pub/scm/fs/xfs/xfstests-dev.git
on test machine: 4 threads Intel(R) Core(TM) i5-6500 CPU @ 3.20GHz (Skylake) with 32G memory
caused below changes (please refer to attached dmesg/kmsg for entire log/backtrace):
[ 63.714777][ T1373] run fstests generic/440 at 2023-02-03 01:47:05 [ 66.322355][ T1973] F2FS-fs (sda4): Found nat_bits in checkpoint [ 66.920249][ T1973] F2FS-fs (sda4): Mounted with checkpoint version = 194d8365 [ 66.952346][ T1983] xfs_io (pid 1983) is setting deprecated v1 encryption policy; recommend upgrading to v2. [ 68.956618][ T2010] F2FS-fs (sda4): Found nat_bits in checkpoint [ 69.578430][ T2010] F2FS-fs (sda4): Mounted with checkpoint version = 62a6fc2b [ 69.824624][ T2111] fscrypt: AES-256-CTS-CBC using implementation "cts-cbc-aes-aesni" [ 69.851641][ T1712] fscrypt: AES-256-XTS using implementation "xts-aes-aesni" [ 70.337764][ T2125] F2FS-fs (sda4): Found nat_bits in checkpoint [ 70.927766][ T2125] F2FS-fs (sda4): Mounted with checkpoint version = 62a6fc2e [ 71.349898][ T2167] BUG: Bad page state in process 440 pfn:1803ec [ 71.356070][ T2167] page:00000000a7ddf13f refcount:0 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x1803ec [ 71.366105][ T2167] memcg:ffff8881f31f8000 [ 71.370178][ T2167] flags: 0x17ffffc0000000(node=0|zone=2|lastcpupid=0x1fffff) [ 71.377365][ T2167] raw: 0017ffffc0000000 dead000000000100 dead000000000122 0000000000000000 [ 71.385759][ T2167] raw: 0000000000000000 0000000000000000 00000000ffffffff ffff8881f31f8000 [ 71.394149][ T2167] page dumped because: page still charged to cgroup [ 71.400554][ T2167] Modules linked in: dm_mod f2fs crc32_generic ipmi_devintf ipmi_msghandler btrfs blake2b_generic xor raid6_pq zstd_compress intel_rapl_msr libcrc32c intel_rapl_common sd_mod t10_pi x86_pkg_temp_thermal intel_powerclamp crc64_rocksoft_generic crc64_rocksoft coretemp crc64 sg kvm_intel i915 kvm irqbypass crct10dif_pclmul drm_buddy crc32_pclmul intel_gtt crc32c_intel drm_display_helper ghash_clmulni_intel sha512_ssse3 ttm ahci mei_wdt rapl drm_kms_helper libahci intel_cstate wmi_bmof intel_uncore mei_me syscopyarea sysfillrect i2c_i801 video i2c_smbus mei libata sysimgblt intel_pch_thermal wmi intel_pmc_core acpi_pad drm fuse ip_tables [ 71.457998][ T2167] CPU: 2 PID: 2167 Comm: 440 Tainted: G I 6.2.0-rc5-00206-gfc581f48adff #2 [ 71.467774][ T2167] Hardware name: Dell Inc. OptiPlex 7040/0Y7WYT, BIOS 1.1.1 10/07/2015 [ 71.475821][ T2167] Call Trace: [ 71.478959][ T2167] <TASK> [ 71.481754][ T2167] dump_stack_lvl (lib/dump_stack.c:107 (discriminator 1)) [ 71.486091][ T2167] bad_page.cold (mm/page_alloc.c:699) [ 71.490341][ T2167] free_pcppages_bulk (mm/page_alloc.c:1598) [ 71.495198][ T2167] free_unref_page (arch/x86/include/asm/paravirt.h:596 arch/x86/include/asm/qspinlock.h:57 include/linux/spinlock.h:203 include/linux/spinlock_api_smp.h:142 include/linux/spinlock.h:390 mm/page_alloc.c:3488) [ 71.499792][ T2167] __mmdrop (arch/x86/include/asm/mmu_context.h:125 (discriminator 3) kernel/fork.c:796 (discriminator 3)) [ 71.503698][ T2167] finish_task_switch+0x486/0x720 [ 71.509157][ T2167] schedule_tail (arch/x86/include/asm/current.h:41 kernel/sched/core.c:5231) [ 71.513408][ T2167] ret_from_fork (arch/x86/entry/entry_64.S:295) [ 71.517573][ T2167] </TASK> [ 71.520440][ T2167] Disabling lock debugging due to kernel taint [ 71.668880][ T2169] F2FS-fs (sda4): Found nat_bits in checkpoint [ 72.266606][ T2169] F2FS-fs (sda4): Mounted with checkpoint version = 62a6fc31 [ 74.907644][ T2213] F2FS-fs (sda4): Found nat_bits in checkpoint [ 75.506786][ T2213] F2FS-fs (sda4): Mounted with checkpoint version = 62a6fc33 [ 75.713226][ T244] generic/440 _check_dmesg: something found in dmesg (see /lkp/benchmarks/xfstests/results//generic/440.dmesg)
Please note that this issue is not 100% reproducible in our tests. We got about 50% chance to reproduce the issue in multiple rounds of tests.
If you fix the issue, kindly add following tag | Reported-by: kernel test robot yujie.liu@intel.com | Link: https://lore.kernel.org/oe-lkp/202302031333.e7a563c1-yujie.liu@intel.com
To reproduce:
git clone https://github.com/intel/lkp-tests.git cd lkp-tests sudo bin/lkp install job.yaml # job file is attached in this email bin/lkp split-job --compatible job.yaml # generate the yaml file for lkp run sudo bin/lkp run generated-yaml-file
# if come across any failure that blocks the test, # please remove ~/.lkp and /lkp dir to run from a clean state.
linux-stable-mirror@lists.linaro.org