July 2024 - Linux-stable-mirror

[PATCH v3] cxl: Fix possible null pointer dereference in read_handle()

by Ma Ke

In read_handle(), of_get_address() may return NULL which is later dereferenced. Fix this by adding NULL check. Based on our customized static analysis tool, extract vulnerability features[1], then match similar vulnerability features in this function. [1] https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit /?id=2d9adecc88ab678785b581ab021f039372c324cb Cc: stable(a)vger.kernel.org Fixes: 14baf4d9c739 ("cxl: Add guest-specific code") Signed-off-by: Ma Ke <make24(a)iscas.ac.cn> --- Changes in v3: - fixed up the changelog text as suggestions. Changes in v2: - added an explanation of how the potential vulnerability was discovered, but not meet the description specification requirements. --- drivers/misc/cxl/of.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/misc/cxl/of.c b/drivers/misc/cxl/of.c index bcc005dff1c0..d8dbb3723951 100644 --- a/drivers/misc/cxl/of.c +++ b/drivers/misc/cxl/of.c @@ -58,7 +58,7 @@ static int read_handle(struct device_node *np, u64 *handle) /* Get address and size of the node */ prop = of_get_address(np, 0, &size, NULL); - if (size) + if (!prop || size) return -EINVAL; /* Helper to read a big number; size is in cells (not bytes) */ -- 2.25.1

11 months, 3 weeks

2
1
0 0

[PATCH v2 2/2] ext4: Testing lock class and subclass got the same name pointer

by botta633

From: Ahmed Ehab <bottaawesome633(a)gmail.com> Checking if the lockdep_map->name will change when setting the subclass. It shouldn't change so that the lock class and subclass will have the same name Reported-by: <syzbot+7f4a6f7f7051474e40ad(a)syzkaller.appspotmail.com> Fixes: fd5e3f5fe27 Cc: <stable(a)vger.kernel.org> Signed-off-by: botta633 <bottaawesome633(a)gmail.com> --- lib/locking-selftest.c | 16 ++++++++++++++++ 1 file changed, 16 insertions(+) diff --git a/lib/locking-selftest.c b/lib/locking-selftest.c index 6f6a5fc85b42..1d7885205f36 100644 --- a/lib/locking-selftest.c +++ b/lib/locking-selftest.c @@ -2710,12 +2710,24 @@ static void local_lock_3B(void) } +static void class_subclass_X1_name(void) +{ + const char *name_before_subclass = rwsem_X1.dep_map.name; + const char *name_after_subclass; + + WARN_ON(!rwsem_X1.dep_map.name); + lockdep_set_subclass(&rwsem_X1, 1); + WARN_ON(name_before_subclass != name_after_subclass); +} + static void local_lock_tests(void) { printk(" --------------------------------------------------------------------------\n"); printk(" | local_lock tests |\n"); printk(" ---------------------\n"); + init_class_X(&lock_X1, &rwlock_X1, &mutex_X1, &rwsem_X1); + print_testname("local_lock inversion 2"); dotest(local_lock_2, SUCCESS, LOCKTYPE_LL); pr_cont("\n"); @@ -2727,6 +2739,10 @@ static void local_lock_tests(void) print_testname("local_lock inversion 3B"); dotest(local_lock_3B, FAILURE, LOCKTYPE_LL); pr_cont("\n"); + + print_testname("Class and subclass"); + dotest(class_subclass_X1_name, SUCCESS, LOCKTYPE_RWSEM); + pr_cont("\n"); } static void hardirq_deadlock_softirq_not_deadlock(void) -- 2.45.2

11 months, 3 weeks

1
0
0 0

[PATCH v2] media: ov5675: Fix power on/off delay timings

by Bryan O'Donoghue

The ov5675 specification says that the gap between XSHUTDN deassert and the first I2C transaction should be a minimum of 8192 XVCLK cycles. Right now we use a usleep_rage() that gives a sleep time of between about 430 and 860 microseconds. On the Lenovo X13s we have observed that in about 1/20 cases the current timing is too tight and we start transacting before the ov5675's reset cycle completes, leading to I2C bus transaction failures. The reset racing is sometimes triggered at initial chip probe but, more usually on a subsequent power-off/power-on cycle e.g. [ 71.451662] ov5675 24-0010: failed to write reg 0x0103. error = -5 [ 71.451686] ov5675 24-0010: failed to set plls The current quiescence period we have is too tight. Instead of expressing the post reset delay in terms of the current XVCLK this patch converts the power-on and power-off delays to the maximum theoretical delay @ 6 MHz with an additional buffer. 1.365 milliseconds on the power-on path is 1.5 milliseconds with grace. 853 microseconds on the power-off path is 900 microseconds with grace. Fixes: 49d9ad719e89 ("media: ov5675: add device-tree support and support runtime PM") Cc: stable(a)vger.kernel.org Signed-off-by: Bryan O'Donoghue <bryan.odonoghue(a)linaro.org> --- v2: - Drop patch to read and act on reported XVCLK - Use worst-case timings + a reasonable grace period in-lieu of previous xvclk calculations on power-on and power-off. - Link to v1: https://lore.kernel.org/r/20240711-linux-next-ov5675-v1-0-69e9b6c62c16@lina… v1: One long running saga for me on the Lenovo X13s is the occasional failure to either probe or subsequently bring-up the ov5675 main RGB sensor on the laptop. Initially I suspected the PMIC for this part as the PMIC is using a new interface on an I2C bus instead of an SPMI bus. In particular I thought perhaps the I2C write to PMIC had completed but the regulator output hadn't become stable from the perspective of the SoC. This however doesn't appear to be the case - I can introduce a delay of milliseconds on the PMIC path without resolving the sensor reset problem. Secondly I thought about reset pin polarity or drive-strength but, again playing about with both didn't yield decent results. I also played with the duration of reset to no avail. The error manifested as an I2C write timeout to the sensor which indicated that the chip likely hadn't come out reset. An intermittent fault appearing in perhaps 1/10 or 1/20 reset cycles. Looking at the expression of the reset we see that there is a minimum time expressed in XVCLK cycles between reset completion and first I2C transaction to the sensor. The specification calls out the minimum delay @ 8192 XVCLK cycles and the ov5675 driver meets that timing almost exactly. A little too exactly - testing finally showed that we were too racy with respect to the minimum quiescence between reset completion and first command to the chip. Fixing this error I choose to base the fix again on the number of clocks but to also support any clock rate the chip could support by moving away from a define to reading and using the XVCLK. True enough only 19.2 MHz is currently supported but for the hypothetical case where some other frequency is supported in the future, I wanted the fix introduced in this series to still hold. Hence this series: 1. Allows for any clock rate to be used in the valid range for the reset. 2. Elongates the post-reset period based on clock cycles which can now vary. Patch #2 can still be backported to stable irrespective of patch #1. --- drivers/media/i2c/ov5675.c | 12 ++++++------ 1 file changed, 6 insertions(+), 6 deletions(-) diff --git a/drivers/media/i2c/ov5675.c b/drivers/media/i2c/ov5675.c index 3641911bc73f..547d6fab816a 100644 --- a/drivers/media/i2c/ov5675.c +++ b/drivers/media/i2c/ov5675.c @@ -972,12 +972,10 @@ static int ov5675_set_stream(struct v4l2_subdev *sd, int enable) static int ov5675_power_off(struct device *dev) { - /* 512 xvclk cycles after the last SCCB transation or MIPI frame end */ - u32 delay_us = DIV_ROUND_UP(512, OV5675_XVCLK_19_2 / 1000 / 1000); struct v4l2_subdev *sd = dev_get_drvdata(dev); struct ov5675 *ov5675 = to_ov5675(sd); - usleep_range(delay_us, delay_us * 2); + usleep_range(900, 1000); clk_disable_unprepare(ov5675->xvclk); gpiod_set_value_cansleep(ov5675->reset_gpio, 1); @@ -988,7 +986,6 @@ static int ov5675_power_off(struct device *dev) static int ov5675_power_on(struct device *dev) { - u32 delay_us = DIV_ROUND_UP(8192, OV5675_XVCLK_19_2 / 1000 / 1000); struct v4l2_subdev *sd = dev_get_drvdata(dev); struct ov5675 *ov5675 = to_ov5675(sd); int ret; @@ -1014,8 +1011,11 @@ static int ov5675_power_on(struct device *dev) gpiod_set_value_cansleep(ov5675->reset_gpio, 0); - /* 8192 xvclk cycles prior to the first SCCB transation */ - usleep_range(delay_us, delay_us * 2); + /* Worst case quiesence gap is 1.365 milliseconds @ 6MHz XVCLK + * Add an additional threshold grace period to ensure reset + * completion before initiating our first I2C transaction. + */ + usleep_range(1500, 1600); return 0; } --- base-commit: 523b23f0bee3014a7a752c9bb9f5c54f0eddae88 change-id: 20240710-linux-next-ov5675-60b0e83c73f1 Best regards, -- Bryan O'Donoghue <bryan.odonoghue(a)linaro.org>

11 months, 3 weeks

2
2
0 0

[PATCH] md/raid1: set max_sectors during early return from choose_slow_rdev()

by Mateusz Jończyk

Linux 6.9+ is unable to start a degraded RAID1 array with one drive, when that drive has a write-mostly flag set. During such an attempt, the following assertion in bio_split() is hit: BUG_ON(sectors <= 0); Call Trace: ? bio_split+0x96/0xb0 ? exc_invalid_op+0x53/0x70 ? bio_split+0x96/0xb0 ? asm_exc_invalid_op+0x1b/0x20 ? bio_split+0x96/0xb0 ? raid1_read_request+0x890/0xd20 ? __call_rcu_common.constprop.0+0x97/0x260 raid1_make_request+0x81/0xce0 ? __get_random_u32_below+0x17/0x70 ? new_slab+0x2b3/0x580 md_handle_request+0x77/0x210 md_submit_bio+0x62/0xa0 __submit_bio+0x17b/0x230 submit_bio_noacct_nocheck+0x18e/0x3c0 submit_bio_noacct+0x244/0x670 After investigation, it turned out that choose_slow_rdev() does not set the value of max_sectors in some cases and because of it, raid1_read_request calls bio_split with sectors == 0. Fix it by filling in this variable. This bug was introduced in commit dfa8ecd167c1 ("md/raid1: factor out choose_slow_rdev() from read_balance()") but apparently hidden until commit 0091c5a269ec ("md/raid1: factor out helpers to choose the best rdev from read_balance()") shortly thereafter. Cc: stable(a)vger.kernel.org # 6.9.x+ Signed-off-by: Mateusz Jończyk <mat.jonczyk(a)o2.pl> Fixes: dfa8ecd167c1 ("md/raid1: factor out choose_slow_rdev() from read_balance()") Cc: Song Liu <song(a)kernel.org> Cc: Yu Kuai <yukuai3(a)huawei.com> Cc: Paul Luse <paul.e.luse(a)linux.intel.com> Cc: Xiao Ni <xni(a)redhat.com> Cc: Mariusz Tkaczyk <mariusz.tkaczyk(a)linux.intel.com> Link: https://lore.kernel.org/linux-raid/20240706143038.7253-1-mat.jonczyk@o2.pl/ -- Tested on both Linux 6.10 and 6.9.8. Inside a VM, mdadm testsuite for RAID1 on 6.10 did not find any problems: ./test --dev=loop --no-error --raidtype=raid1 (on 6.9.8 there was one failure, caused by external bitmap support not compiled in). Notes: - I was reliably getting deadlocks when adding / removing devices on such an array - while the array was loaded with fsstress with 20 concurrent processes. When the array was idle or loaded with fsstress with 8 processes, no such deadlocks happened in my tests. This occurred also on unpatched Linux 6.8.0 though, but not on 6.1.97-rc1, so this is likely an independent regression (to be investigated). - I was also getting deadlocks when adding / removing the bitmap on the array in similar conditions - this happened on Linux 6.1.97-rc1 also though. fsstress with 8 concurrent processes did cause it only once during many tests. - in my testing, there was once a problem with hot adding an internal bitmap to the array: mdadm: Cannot add bitmap while array is resyncing or reshaping etc. mdadm: failed to set internal bitmap. even though no such reshaping was happening according to /proc/mdstat. This seems unrelated, though. --- drivers/md/raid1.c | 1 + 1 file changed, 1 insertion(+) diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c index 7b8a71ca66dd..82f70a4ce6ed 100644 --- a/drivers/md/raid1.c +++ b/drivers/md/raid1.c @@ -680,6 +680,7 @@ static int choose_slow_rdev(struct r1conf *conf, struct r1bio *r1_bio, len = r1_bio->sectors; read_len = raid1_check_read_range(rdev, this_sector, &len); if (read_len == r1_bio->sectors) { + *max_sectors = read_len; update_read_sectors(conf, disk, this_sector, read_len); return disk; } base-commit: 256abd8e550ce977b728be79a74e1729438b4948 -- 2.25.1

11 months, 3 weeks

4
4
0 0

[to-be-updated] mm-huge_memory-avoid-pmd-size-page-cache-if-needed.patch removed from -mm tree

by Andrew Morton

The quilt patch titled Subject: mm/huge_memory: avoid PMD-size page cache if needed has been removed from the -mm tree. Its filename was mm-huge_memory-avoid-pmd-size-page-cache-if-needed.patch This patch was dropped because an updated version will be issued ------------------------------------------------------ From: Gavin Shan <gshan(a)redhat.com> Subject: mm/huge_memory: avoid PMD-size page cache if needed Date: Thu, 11 Jul 2024 20:48:40 +1000 Currently, xarray can't support arbitrary page cache size and the largest and supported page cache size is defined as MAX_PAGECACHE_ORDER in commit 099d90642a71 ("mm/filemap: make MAX_PAGECACHE_ORDER acceptable to xarray"). However, it's possible to have 512MB page cache in the huge memory collapsing path on ARM64 system whose base page size is 64KB. A warning is raised when the huge page cache is split as shown in the following example. [root@dhcp-10-26-1-207 ~]# cat /proc/1/smaps | grep KernelPageSize KernelPageSize: 64 kB [root@dhcp-10-26-1-207 ~]# cat /tmp/test.c : int main(int argc, char **argv) { const char *filename = TEST_XFS_FILENAME; int fd = 0; void *buf = (void *)-1, *p; int pgsize = getpagesize(); int ret = 0; if (pgsize != 0x10000) { fprintf(stdout, "System with 64KB base page size is required!\n"); return -EPERM; } system("echo 0 > /sys/devices/virtual/bdi/253:0/read_ahead_kb"); system("echo 1 > /proc/sys/vm/drop_caches"); /* Open xfs or shmem file */ fd = open(filename, O_RDONLY); assert(fd > 0); /* Create VMA */ buf = mmap(NULL, TEST_MEM_SIZE, PROT_READ, MAP_SHARED, fd, 0); assert(buf != (void *)-1); fprintf(stdout, "mapped buffer at 0x%p\n", buf); /* Populate VMA */ ret = madvise(buf, TEST_MEM_SIZE, MADV_NOHUGEPAGE); assert(ret == 0); ret = madvise(buf, TEST_MEM_SIZE, MADV_POPULATE_READ); assert(ret == 0); /* Collapse VMA */ ret = madvise(buf, TEST_MEM_SIZE, MADV_HUGEPAGE); assert(ret == 0); ret = madvise(buf, TEST_MEM_SIZE, MADV_COLLAPSE); if (ret) { fprintf(stdout, "Error %d to madvise(MADV_COLLAPSE)\n", errno); goto out; } /* Split xarray. The file needs to reopened with write permission */ munmap(buf, TEST_MEM_SIZE); buf = (void *)-1; close(fd); fd = open(filename, O_RDWR); assert(fd > 0); fallocate(fd, FALLOC_FL_KEEP_SIZE | FALLOC_FL_PUNCH_HOLE, TEST_MEM_SIZE - pgsize, pgsize); out: if (buf != (void *)-1) munmap(buf, TEST_MEM_SIZE); if (fd > 0) close(fd); return ret; } [root@dhcp-10-26-1-207 ~]# gcc /tmp/test.c -o /tmp/test [root@dhcp-10-26-1-207 ~]# /tmp/test ------------[ cut here ]------------ WARNING: CPU: 25 PID: 7560 at lib/xarray.c:1025 xas_split_alloc+0xf8/0x128 Modules linked in: nft_fib_inet nft_fib_ipv4 nft_fib_ipv6 nft_fib \ nft_reject_inet nf_reject_ipv4 nf_reject_ipv6 nft_reject nft_ct \ nft_chain_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 \ ip_set rfkill nf_tables nfnetlink vfat fat virtio_balloon drm fuse \ xfs libcrc32c crct10dif_ce ghash_ce sha2_ce sha256_arm64 virtio_net \ sha1_ce net_failover virtio_blk virtio_console failover dimlib virtio_mmio CPU: 25 PID: 7560 Comm: test Kdump: loaded Not tainted 6.10.0-rc7-gavin+ #9 Hardware name: QEMU KVM Virtual Machine, BIOS edk2-20240524-1.el9 05/24/2024 pstate: 83400005 (Nzcv daif +PAN -UAO +TCO +DIT -SSBS BTYPE=--) pc : xas_split_alloc+0xf8/0x128 lr : split_huge_page_to_list_to_order+0x1c4/0x780 sp : ffff8000ac32f660 x29: ffff8000ac32f660 x28: ffff0000e0969eb0 x27: ffff8000ac32f6c0 x26: 0000000000000c40 x25: ffff0000e0969eb0 x24: 000000000000000d x23: ffff8000ac32f6c0 x22: ffffffdfc0700000 x21: 0000000000000000 x20: 0000000000000000 x19: ffffffdfc0700000 x18: 0000000000000000 x17: 0000000000000000 x16: ffffd5f3708ffc70 x15: 0000000000000000 x14: 0000000000000000 x13: 0000000000000000 x12: 0000000000000000 x11: ffffffffffffffc0 x10: 0000000000000040 x9 : ffffd5f3708e692c x8 : 0000000000000003 x7 : 0000000000000000 x6 : ffff0000e0969eb8 x5 : ffffd5f37289e378 x4 : 0000000000000000 x3 : 0000000000000c40 x2 : 000000000000000d x1 : 000000000000000c x0 : 0000000000000000 Call trace: xas_split_alloc+0xf8/0x128 split_huge_page_to_list_to_order+0x1c4/0x780 truncate_inode_partial_folio+0xdc/0x160 truncate_inode_pages_range+0x1b4/0x4a8 truncate_pagecache_range+0x84/0xa0 xfs_flush_unmap_range+0x70/0x90 [xfs] xfs_file_fallocate+0xfc/0x4d8 [xfs] vfs_fallocate+0x124/0x2f0 ksys_fallocate+0x4c/0xa0 __arm64_sys_fallocate+0x24/0x38 invoke_syscall.constprop.0+0x7c/0xd8 do_el0_svc+0xb4/0xd0 el0_svc+0x44/0x1d8 el0t_64_sync_handler+0x134/0x150 el0t_64_sync+0x17c/0x180 Fix it by avoiding PMD-sized page cache in the huge memory collapsing path. After this patch is applied, the test program fails with error -EINVAL returned from __thp_vma_allowable_orders() and the madvise() system call to collapse the page caches. Link: https://lkml.kernel.org/r/20240711104840.200573-1-gshan@redhat.com Fixes: 6b24ca4a1a8d ("mm: Use multi-index entries in the page cache") Signed-off-by: Gavin Shan <gshan(a)redhat.com> Cc: David Hildenbrand <david(a)redhat.com> Cc: Matthew Wilcox <willy(a)infradead.org> Cc: Ryan Roberts <ryan.roberts(a)arm.com> Cc: William Kucharski <william.kucharski(a)oracle.com> Cc: <stable(a)vger.kernel.org> [5.17+] Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org> --- mm/huge_memory.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) --- a/mm/huge_memory.c~mm-huge_memory-avoid-pmd-size-page-cache-if-needed +++ a/mm/huge_memory.c @@ -136,7 +136,8 @@ unsigned long __thp_vma_allowable_orders while (orders) { addr = vma->vm_end - (PAGE_SIZE << order); - if (thp_vma_suitable_order(vma, addr, order)) + if (!(vma->vm_file && order > MAX_PAGECACHE_ORDER) && + thp_vma_suitable_order(vma, addr, order)) break; order = next_order(&orders, order); } _ Patches currently in -mm which might be from gshan(a)redhat.com are

11 months, 3 weeks

1
0
0 0

[PATCH mm-unstable v1] mm/mglru: fix ineffective protection calculation

by Yu Zhao

mem_cgroup_calculate_protection() is not stateless and should only be used as part of a top-down tree traversal. shrink_one() traverses the per-node memcg LRU instead of the root_mem_cgroup tree, and therefore it should not call mem_cgroup_calculate_protection(). The existing misuse in shrink_one() can cause ineffective protection of sub-trees that are grandchildren of root_mem_cgroup. Fix it by reusing lru_gen_age_node(), which already traverses the root_mem_cgroup tree, to calculate the protection. Previously lru_gen_age_node() opportunistically skips the first pass, i.e., when scan_control->priority is DEF_PRIORITY. On the second pass, lruvec_is_sizable() uses appropriate scan_control->priority, set by set_initial_priority() from lru_gen_shrink_node(), to decide whether a memcg is too small to reclaim from. Now lru_gen_age_node() unconditionally traverses the root_mem_cgroup tree. So it should call set_initial_priority() upfront, to make sure lruvec_is_sizable() uses appropriate scan_control->priority on the first pass. Otherwise, lruvec_is_reclaimable() can return false negatives and result in premature OOM kills when min_ttl_ms is used. Reported-by: T.J. Mercier <tjmercier(a)google.com> Fixes: e4dde56cd208 ("mm: multi-gen LRU: per-node lru_gen_folio lists") Cc: stable(a)vger.kernel.org Signed-off-by: Yu Zhao <yuzhao(a)google.com> --- mm/vmscan.c | 86 +++++++++++++++++++++++++---------------------------- 1 file changed, 40 insertions(+), 46 deletions(-) diff --git a/mm/vmscan.c b/mm/vmscan.c index 6216d79edb7f..525d3ffa8451 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -3915,6 +3915,32 @@ static bool try_to_inc_max_seq(struct lruvec *lruvec, unsigned long seq, * working set protection ******************************************************************************/ +static void set_initial_priority(struct pglist_data *pgdat, struct scan_control *sc) +{ + int priority; + unsigned long reclaimable; + + if (sc->priority != DEF_PRIORITY || sc->nr_to_reclaim < MIN_LRU_BATCH) + return; + /* + * Determine the initial priority based on + * (total >> priority) * reclaimed_to_scanned_ratio = nr_to_reclaim, + * where reclaimed_to_scanned_ratio = inactive / total. + */ + reclaimable = node_page_state(pgdat, NR_INACTIVE_FILE); + if (can_reclaim_anon_pages(NULL, pgdat->node_id, sc)) + reclaimable += node_page_state(pgdat, NR_INACTIVE_ANON); + + /* round down reclaimable and round up sc->nr_to_reclaim */ + priority = fls_long(reclaimable) - 1 - fls_long(sc->nr_to_reclaim - 1); + + /* + * The estimation is based on LRU pages only, so cap it to prevent + * overshoots of shrinker objects by large margins. + */ + sc->priority = clamp(priority, DEF_PRIORITY / 2, DEF_PRIORITY); +} + static bool lruvec_is_sizable(struct lruvec *lruvec, struct scan_control *sc) { int gen, type, zone; @@ -3948,19 +3974,17 @@ static bool lruvec_is_reclaimable(struct lruvec *lruvec, struct scan_control *sc struct mem_cgroup *memcg = lruvec_memcg(lruvec); DEFINE_MIN_SEQ(lruvec); + if (mem_cgroup_below_min(NULL, memcg)) + return false; + + if (!lruvec_is_sizable(lruvec, sc)) + return false; + /* see the comment on lru_gen_folio */ gen = lru_gen_from_seq(min_seq[LRU_GEN_FILE]); birth = READ_ONCE(lruvec->lrugen.timestamps[gen]); - if (time_is_after_jiffies(birth + min_ttl)) - return false; - - if (!lruvec_is_sizable(lruvec, sc)) - return false; - - mem_cgroup_calculate_protection(NULL, memcg); - - return !mem_cgroup_below_min(NULL, memcg); + return time_is_before_jiffies(birth + min_ttl); } /* to protect the working set of the last N jiffies */ @@ -3970,23 +3994,20 @@ static void lru_gen_age_node(struct pglist_data *pgdat, struct scan_control *sc) { struct mem_cgroup *memcg; unsigned long min_ttl = READ_ONCE(lru_gen_min_ttl); + bool reclaimable = !min_ttl; VM_WARN_ON_ONCE(!current_is_kswapd()); - /* check the order to exclude compaction-induced reclaim */ - if (!min_ttl || sc->order || sc->priority == DEF_PRIORITY) - return; + set_initial_priority(pgdat, sc); memcg = mem_cgroup_iter(NULL, NULL, NULL); do { struct lruvec *lruvec = mem_cgroup_lruvec(memcg, pgdat); - if (lruvec_is_reclaimable(lruvec, sc, min_ttl)) { - mem_cgroup_iter_break(NULL, memcg); - return; - } + mem_cgroup_calculate_protection(NULL, memcg); - cond_resched(); + if (!reclaimable) + reclaimable = lruvec_is_reclaimable(lruvec, sc, min_ttl); } while ((memcg = mem_cgroup_iter(NULL, memcg, NULL))); /* @@ -3994,7 +4015,7 @@ static void lru_gen_age_node(struct pglist_data *pgdat, struct scan_control *sc) * younger than min_ttl. However, another possibility is all memcgs are * either too small or below min. */ - if (mutex_trylock(&oom_lock)) { + if (!reclaimable && mutex_trylock(&oom_lock)) { struct oom_control oc = { .gfp_mask = sc->gfp_mask, }; @@ -4786,8 +4807,7 @@ static int shrink_one(struct lruvec *lruvec, struct scan_control *sc) struct mem_cgroup *memcg = lruvec_memcg(lruvec); struct pglist_data *pgdat = lruvec_pgdat(lruvec); - mem_cgroup_calculate_protection(NULL, memcg); - + /* lru_gen_age_node() called mem_cgroup_calculate_protection() */ if (mem_cgroup_below_min(NULL, memcg)) return MEMCG_LRU_YOUNG; @@ -4911,32 +4931,6 @@ static void lru_gen_shrink_lruvec(struct lruvec *lruvec, struct scan_control *sc blk_finish_plug(&plug); } -static void set_initial_priority(struct pglist_data *pgdat, struct scan_control *sc) -{ - int priority; - unsigned long reclaimable; - - if (sc->priority != DEF_PRIORITY || sc->nr_to_reclaim < MIN_LRU_BATCH) - return; - /* - * Determine the initial priority based on - * (total >> priority) * reclaimed_to_scanned_ratio = nr_to_reclaim, - * where reclaimed_to_scanned_ratio = inactive / total. - */ - reclaimable = node_page_state(pgdat, NR_INACTIVE_FILE); - if (can_reclaim_anon_pages(NULL, pgdat->node_id, sc)) - reclaimable += node_page_state(pgdat, NR_INACTIVE_ANON); - - /* round down reclaimable and round up sc->nr_to_reclaim */ - priority = fls_long(reclaimable) - 1 - fls_long(sc->nr_to_reclaim - 1); - - /* - * The estimation is based on LRU pages only, so cap it to prevent - * overshoots of shrinker objects by large margins. - */ - sc->priority = clamp(priority, DEF_PRIORITY / 2, DEF_PRIORITY); -} - static void lru_gen_shrink_node(struct pglist_data *pgdat, struct scan_control *sc) { struct blk_plug plug; -- 2.45.2.993.g49e7a77208-goog

11 months, 3 weeks

1
0
0 0

[merged mm-stable] mm-mglru-fix-overshooting-shrinker-memory.patch removed from -mm tree

by Andrew Morton

The quilt patch titled Subject: mm/mglru: fix overshooting shrinker memory has been removed from the -mm tree. Its filename was mm-mglru-fix-overshooting-shrinker-memory.patch This patch was dropped because it was merged into the mm-stable branch of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm ------------------------------------------------------ From: Yu Zhao <yuzhao(a)google.com> Subject: mm/mglru: fix overshooting shrinker memory Date: Thu, 11 Jul 2024 13:19:57 -0600 set_initial_priority() tries to jump-start global reclaim by estimating the priority based on cold/hot LRU pages. The estimation does not account for shrinker objects, and it cannot do so because their sizes can be in different units other than page. If shrinker objects are the majority, e.g., on TrueNAS SCALE 24.04.0 where ZFS ARC can use almost all system memory, set_initial_priority() can vastly underestimate how much memory ARC shrinker can evict and assign extreme low values to scan_control->priority, resulting in overshoots of shrinker objects. To reproduce the problem, using TrueNAS SCALE 24.04.0 with 32GB DRAM, a test ZFS pool and the following commands: fio --name=mglru.file --numjobs=36 --ioengine=io_uring \ --directory=/root/test-zfs-pool/ --size=1024m --buffered=1 \ --rw=randread --random_distribution=random \ --time_based --runtime=1h & for ((i = 0; i < 20; i++)) do sleep 120 fio --name=mglru.anon --numjobs=16 --ioengine=mmap \ --filename=/dev/zero --size=1024m --fadvise_hint=0 \ --rw=randrw --random_distribution=random \ --time_based --runtime=1m done To fix the problem: 1. Cap scan_control->priority at or above DEF_PRIORITY/2, to prevent the jump-start from being overly aggressive. 2. Account for the progress from mm_account_reclaimed_pages(), to prevent kswapd_shrink_node() from raising the priority unnecessarily. Link: https://lkml.kernel.org/r/20240711191957.939105-2-yuzhao@google.com Fixes: e4dde56cd208 ("mm: multi-gen LRU: per-node lru_gen_folio lists") Signed-off-by: Yu Zhao <yuzhao(a)google.com> Reported-by: Alexander Motin <mav(a)ixsystems.com> Cc: Wei Xu <weixugc(a)google.com> Cc: <stable(a)vger.kernel.org> Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org> --- mm/vmscan.c | 10 ++++++++-- 1 file changed, 8 insertions(+), 2 deletions(-) --- a/mm/vmscan.c~mm-mglru-fix-overshooting-shrinker-memory +++ a/mm/vmscan.c @@ -4930,7 +4930,11 @@ static void set_initial_priority(struct /* round down reclaimable and round up sc->nr_to_reclaim */ priority = fls_long(reclaimable) - 1 - fls_long(sc->nr_to_reclaim - 1); - sc->priority = clamp(priority, 0, DEF_PRIORITY); + /* + * The estimation is based on LRU pages only, so cap it to prevent + * overshoots of shrinker objects by large margins. + */ + sc->priority = clamp(priority, DEF_PRIORITY / 2, DEF_PRIORITY); } static void lru_gen_shrink_node(struct pglist_data *pgdat, struct scan_control *sc) @@ -6754,6 +6758,7 @@ static bool kswapd_shrink_node(pg_data_t { struct zone *zone; int z; + unsigned long nr_reclaimed = sc->nr_reclaimed; /* Reclaim a number of pages proportional to the number of zones */ sc->nr_to_reclaim = 0; @@ -6781,7 +6786,8 @@ static bool kswapd_shrink_node(pg_data_t if (sc->order && sc->nr_reclaimed >= compact_gap(sc->order)) sc->order = 0; - return sc->nr_scanned >= sc->nr_to_reclaim; + /* account for progress from mm_account_reclaimed_pages() */ + return max(sc->nr_scanned, sc->nr_reclaimed - nr_reclaimed) >= sc->nr_to_reclaim; } /* Page allocator PCP high watermark is lowered if reclaim is active. */ _ Patches currently in -mm which might be from yuzhao(a)google.com are

11 months, 3 weeks

1
0
0 0

[merged mm-stable] mm-mglru-fix-div-by-zero-in-vmpressure_calc_level.patch removed from -mm tree

by Andrew Morton

The quilt patch titled Subject: mm/mglru: fix div-by-zero in vmpressure_calc_level() has been removed from the -mm tree. Its filename was mm-mglru-fix-div-by-zero-in-vmpressure_calc_level.patch This patch was dropped because it was merged into the mm-stable branch of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm ------------------------------------------------------ From: Yu Zhao <yuzhao(a)google.com> Subject: mm/mglru: fix div-by-zero in vmpressure_calc_level() Date: Thu, 11 Jul 2024 13:19:56 -0600 evict_folios() uses a second pass to reclaim folios that have gone through page writeback and become clean before it finishes the first pass, since folio_rotate_reclaimable() cannot handle those folios due to the isolation. The second pass tries to avoid potential double counting by deducting scan_control->nr_scanned. However, this can result in underflow of nr_scanned, under a condition where shrink_folio_list() does not increment nr_scanned, i.e., when folio_trylock() fails. The underflow can cause the divisor, i.e., scale=scanned+reclaimed in vmpressure_calc_level(), to become zero, resulting in the following crash: [exception RIP: vmpressure_work_fn+101] process_one_work at ffffffffa3313f2b Since scan_control->nr_scanned has no established semantics, the potential double counting has minimal risks. Therefore, fix the problem by not deducting scan_control->nr_scanned in evict_folios(). Link: https://lkml.kernel.org/r/20240711191957.939105-1-yuzhao@google.com Fixes: 359a5e1416ca ("mm: multi-gen LRU: retry folios written back while isolated") Reported-by: Wei Xu <weixugc(a)google.com> Signed-off-by: Yu Zhao <yuzhao(a)google.com> Cc: Alexander Motin <mav(a)ixsystems.com> Cc: <stable(a)vger.kernel.org> Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org> --- mm/vmscan.c | 1 - 1 file changed, 1 deletion(-) --- a/mm/vmscan.c~mm-mglru-fix-div-by-zero-in-vmpressure_calc_level +++ a/mm/vmscan.c @@ -4597,7 +4597,6 @@ retry: /* retry folios that may have missed folio_rotate_reclaimable() */ list_move(&folio->lru, &clean); - sc->nr_scanned -= folio_nr_pages(folio); } spin_lock_irq(&lruvec->lru_lock); _ Patches currently in -mm which might be from yuzhao(a)google.com are

11 months, 3 weeks

1
0
0 0

[merged mm-stable] mm-hugetlb-fix-potential-race-with-try_memory_failure_hugetlb.patch removed from -mm tree

by Andrew Morton

The quilt patch titled Subject: mm/hugetlb: fix potential race with try_memory_failure_hugetlb() has been removed from the -mm tree. Its filename was mm-hugetlb-fix-potential-race-with-try_memory_failure_hugetlb.patch This patch was dropped because it was merged into the mm-stable branch of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm ------------------------------------------------------ From: Miaohe Lin <linmiaohe(a)huawei.com> Subject: mm/hugetlb: fix potential race with try_memory_failure_hugetlb() Date: Wed, 10 Jul 2024 16:14:45 +0800 There is a potential race between __update_and_free_hugetlb_folio() and try_memory_failure_hugetlb(): CPU1 CPU2 __update_and_free_hugetlb_folio try_memory_failure_hugetlb spin_lock_irq(&hugetlb_lock); __get_huge_page_for_hwpoison folio_test_hugetlb -- It's still hugetlb folio. folio_test_hugetlb_raw_hwp_unreliable -- raw_hwp_unreliable flag is not set yet. folio_set_hugetlb_hwpoison -- raw_hwp_unreliable flag might be set. spin_unlock_irq(&hugetlb_lock); spin_lock_irq(&hugetlb_lock); __folio_clear_hugetlb(folio); -- Hugetlb flag is cleared but too late! spin_unlock_irq(&hugetlb_lock); When this race occurs, raw error pages will hit pcplists/buddy. Fix this issue by deferring folio_test_hugetlb_raw_hwp_unreliable() until __folio_clear_hugetlb() is done. The raw_hwp_unreliable flag cannot be set after hugetlb folio flag is cleared. Link: https://lkml.kernel.org/r/20240710081445.3307355-1-linmiaohe@huawei.com Fixes: 32c877191e02 ("hugetlb: do not clear hugetlb dtor until allocating vmemmap") Signed-off-by: Miaohe Lin <linmiaohe(a)huawei.com> Cc: Muchun Song <muchun.song(a)linux.dev> Cc: <stable(a)vger.kernel.org> Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org> --- mm/hugetlb.c | 14 +++++++------- 1 file changed, 7 insertions(+), 7 deletions(-) --- a/mm/hugetlb.c~mm-hugetlb-fix-potential-race-with-try_memory_failure_hugetlb +++ a/mm/hugetlb.c @@ -1706,13 +1706,6 @@ static void __update_and_free_hugetlb_fo return; /* - * If we don't know which subpages are hwpoisoned, we can't free - * the hugepage, so it's leaked intentionally. - */ - if (folio_test_hugetlb_raw_hwp_unreliable(folio)) - return; - - /* * If folio is not vmemmap optimized (!clear_flag), then the folio * is no longer identified as a hugetlb page. hugetlb_vmemmap_restore_folio * can only be passed hugetlb pages and will BUG otherwise. @@ -1730,6 +1723,13 @@ static void __update_and_free_hugetlb_fo } /* + * If we don't know which subpages are hwpoisoned, we can't free + * the hugepage, so it's leaked intentionally. + */ + if (folio_test_hugetlb_raw_hwp_unreliable(folio)) + return; + + /* * Move PageHWPoison flag from head page to the raw error pages, * which makes any healthy subpages reusable. */ _ Patches currently in -mm which might be from linmiaohe(a)huawei.com are mm-memory-failure-fix-vm_bug_on_pagepagepoisonedpage-when-unpoison-memory.patch mm-hugetlb-fix-possible-recursive-locking-detected-warning.patch

11 months, 3 weeks

1
0
0 0

[merged mm-stable] mm-shmem-rename-mthp-shmem-counters.patch removed from -mm tree

by Andrew Morton

The quilt patch titled Subject: mm: shmem: rename mTHP shmem counters has been removed from the -mm tree. Its filename was mm-shmem-rename-mthp-shmem-counters.patch This patch was dropped because it was merged into the mm-stable branch of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm ------------------------------------------------------ From: Ryan Roberts <ryan.roberts(a)arm.com> Subject: mm: shmem: rename mTHP shmem counters Date: Wed, 10 Jul 2024 10:55:01 +0100 The legacy PMD-sized THP counters at /proc/vmstat include thp_file_alloc, thp_file_fallback and thp_file_fallback_charge, which rather confusingly refer to shmem THP and do not include any other types of file pages. This is inconsistent since in most other places in the kernel, THP counters are explicitly separated for anon, shmem and file flavours. However, we are stuck with it since it constitutes a user ABI. Recently, commit 66f44583f9b6 ("mm: shmem: add mTHP counters for anonymous shmem") added equivalent mTHP stats for shmem, keeping the same "file_" prefix in the names. But in future, we may want to add extra stats to cover actual file pages, at which point, it would all become very confusing. So let's take the opportunity to rename these new counters "shmem_" before the change makes it upstream and the ABI becomes immutable. While we are at it, let's improve the documentation for the legacy counters to make it clear that they count shmem pages only. Link: https://lkml.kernel.org/r/20240710095503.3193901-1-ryan.roberts@arm.com Signed-off-by: Ryan Roberts <ryan.roberts(a)arm.com> Reviewed-by: Baolin Wang <baolin.wang(a)linux.alibaba.com> Reviewed-by: Lance Yang <ioworker0(a)gmail.com> Reviewed-by: Zi Yan <ziy(a)nvidia.com> Reviewed-by: Barry Song <baohua(a)kernel.org> Acked-by: David Hildenbrand <david(a)redhat.com> Cc: Daniel Gomez <da.gomez(a)samsung.com> Cc: Hugh Dickins <hughd(a)google.com> Cc: Jonathan Corbet <corbet(a)lwn.net> Cc: Matthew Wilcox (Oracle) <willy(a)infradead.org> Cc: <stable(a)vger.kernel.org> Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org> --- Documentation/admin-guide/mm/transhuge.rst | 29 ++++++++++--------- include/linux/huge_mm.h | 6 +-- mm/huge_memory.c | 12 +++---- mm/shmem.c | 8 ++--- 4 files changed, 29 insertions(+), 26 deletions(-) --- a/Documentation/admin-guide/mm/transhuge.rst~mm-shmem-rename-mthp-shmem-counters +++ a/Documentation/admin-guide/mm/transhuge.rst @@ -412,20 +412,23 @@ thp_collapse_alloc_failed the allocation. thp_file_alloc - is incremented every time a file huge page is successfully - allocated. + is incremented every time a shmem huge page is successfully + allocated (Note that despite being named after "file", the counter + measures only shmem). thp_file_fallback - is incremented if a file huge page is attempted to be allocated - but fails and instead falls back to using small pages. + is incremented if a shmem huge page is attempted to be allocated + but fails and instead falls back to using small pages. (Note that + despite being named after "file", the counter measures only shmem). thp_file_fallback_charge - is incremented if a file huge page cannot be charged and instead + is incremented if a shmem huge page cannot be charged and instead falls back to using small pages even though the allocation was - successful. + successful. (Note that despite being named after "file", the + counter measures only shmem). thp_file_mapped - is incremented every time a file huge page is mapped into + is incremented every time a file or shmem huge page is mapped into user address space. thp_split_page @@ -496,16 +499,16 @@ swpout_fallback Usually because failed to allocate some continuous swap space for the huge page. -file_alloc - is incremented every time a file huge page is successfully +shmem_alloc + is incremented every time a shmem huge page is successfully allocated. -file_fallback - is incremented if a file huge page is attempted to be allocated +shmem_fallback + is incremented if a shmem huge page is attempted to be allocated but fails and instead falls back to using small pages. -file_fallback_charge - is incremented if a file huge page cannot be charged and instead +shmem_fallback_charge + is incremented if a shmem huge page cannot be charged and instead falls back to using small pages even though the allocation was successful. --- a/include/linux/huge_mm.h~mm-shmem-rename-mthp-shmem-counters +++ a/include/linux/huge_mm.h @@ -269,9 +269,9 @@ enum mthp_stat_item { MTHP_STAT_ANON_FAULT_FALLBACK_CHARGE, MTHP_STAT_SWPOUT, MTHP_STAT_SWPOUT_FALLBACK, - MTHP_STAT_FILE_ALLOC, - MTHP_STAT_FILE_FALLBACK, - MTHP_STAT_FILE_FALLBACK_CHARGE, + MTHP_STAT_SHMEM_ALLOC, + MTHP_STAT_SHMEM_FALLBACK, + MTHP_STAT_SHMEM_FALLBACK_CHARGE, MTHP_STAT_SPLIT, MTHP_STAT_SPLIT_FAILED, MTHP_STAT_SPLIT_DEFERRED, --- a/mm/huge_memory.c~mm-shmem-rename-mthp-shmem-counters +++ a/mm/huge_memory.c @@ -568,9 +568,9 @@ DEFINE_MTHP_STAT_ATTR(anon_fault_fallbac DEFINE_MTHP_STAT_ATTR(anon_fault_fallback_charge, MTHP_STAT_ANON_FAULT_FALLBACK_CHARGE); DEFINE_MTHP_STAT_ATTR(swpout, MTHP_STAT_SWPOUT); DEFINE_MTHP_STAT_ATTR(swpout_fallback, MTHP_STAT_SWPOUT_FALLBACK); -DEFINE_MTHP_STAT_ATTR(file_alloc, MTHP_STAT_FILE_ALLOC); -DEFINE_MTHP_STAT_ATTR(file_fallback, MTHP_STAT_FILE_FALLBACK); -DEFINE_MTHP_STAT_ATTR(file_fallback_charge, MTHP_STAT_FILE_FALLBACK_CHARGE); +DEFINE_MTHP_STAT_ATTR(shmem_alloc, MTHP_STAT_SHMEM_ALLOC); +DEFINE_MTHP_STAT_ATTR(shmem_fallback, MTHP_STAT_SHMEM_FALLBACK); +DEFINE_MTHP_STAT_ATTR(shmem_fallback_charge, MTHP_STAT_SHMEM_FALLBACK_CHARGE); DEFINE_MTHP_STAT_ATTR(split, MTHP_STAT_SPLIT); DEFINE_MTHP_STAT_ATTR(split_failed, MTHP_STAT_SPLIT_FAILED); DEFINE_MTHP_STAT_ATTR(split_deferred, MTHP_STAT_SPLIT_DEFERRED); @@ -581,9 +581,9 @@ static struct attribute *stats_attrs[] = &anon_fault_fallback_charge_attr.attr, &swpout_attr.attr, &swpout_fallback_attr.attr, - &file_alloc_attr.attr, - &file_fallback_attr.attr, - &file_fallback_charge_attr.attr, + &shmem_alloc_attr.attr, + &shmem_fallback_attr.attr, + &shmem_fallback_charge_attr.attr, &split_attr.attr, &split_failed_attr.attr, &split_deferred_attr.attr, --- a/mm/shmem.c~mm-shmem-rename-mthp-shmem-counters +++ a/mm/shmem.c @@ -1777,7 +1777,7 @@ static struct folio *shmem_alloc_and_add if (pages == HPAGE_PMD_NR) count_vm_event(THP_FILE_FALLBACK); #ifdef CONFIG_TRANSPARENT_HUGEPAGE - count_mthp_stat(order, MTHP_STAT_FILE_FALLBACK); + count_mthp_stat(order, MTHP_STAT_SHMEM_FALLBACK); #endif order = next_order(&suitable_orders, order); } @@ -1804,8 +1804,8 @@ allocated: count_vm_event(THP_FILE_FALLBACK_CHARGE); } #ifdef CONFIG_TRANSPARENT_HUGEPAGE - count_mthp_stat(folio_order(folio), MTHP_STAT_FILE_FALLBACK); - count_mthp_stat(folio_order(folio), MTHP_STAT_FILE_FALLBACK_CHARGE); + count_mthp_stat(folio_order(folio), MTHP_STAT_SHMEM_FALLBACK); + count_mthp_stat(folio_order(folio), MTHP_STAT_SHMEM_FALLBACK_CHARGE); #endif } goto unlock; @@ -2181,7 +2181,7 @@ repeat: if (folio_test_pmd_mappable(folio)) count_vm_event(THP_FILE_ALLOC); #ifdef CONFIG_TRANSPARENT_HUGEPAGE - count_mthp_stat(folio_order(folio), MTHP_STAT_FILE_ALLOC); + count_mthp_stat(folio_order(folio), MTHP_STAT_SHMEM_ALLOC); #endif goto alloced; } _ Patches currently in -mm which might be from ryan.roberts(a)arm.com are

11 months, 3 weeks

1
0
0 0

2025

2024

2023

2022

2021

2020

2019

2018

2017

Linux-stable-mirror July 2024