May 2023 - Linux-stable-mirror

[merged mm-hotfixes-stable] nilfs2-fix-infinite-loop-in-nilfs_mdt_get_block.patch removed from -mm tree

by Andrew Morton

The quilt patch titled Subject: nilfs2: fix infinite loop in nilfs_mdt_get_block() has been removed from the -mm tree. Its filename was nilfs2-fix-infinite-loop-in-nilfs_mdt_get_block.patch This patch was dropped because it was merged into the mm-hotfixes-stable branch of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm ------------------------------------------------------ From: Ryusuke Konishi <konishi.ryusuke(a)gmail.com> Subject: nilfs2: fix infinite loop in nilfs_mdt_get_block() Date: Mon, 1 May 2023 04:30:46 +0900 If the disk image that nilfs2 mounts is corrupted and a virtual block address obtained by block lookup for a metadata file is invalid, nilfs_bmap_lookup_at_level() may return the same internal return code as -ENOENT, meaning the block does not exist in the metadata file. This duplication of return codes confuses nilfs_mdt_get_block(), causing it to read and create a metadata block indefinitely. In particular, if this happens to the inode metadata file, ifile, semaphore i_rwsem can be left held, causing task hangs in lock_mount. Fix this issue by making nilfs_bmap_lookup_at_level() treat virtual block address translation failures with -ENOENT as metadata corruption instead of returning the error code. Link: https://lkml.kernel.org/r/20230430193046.6769-1-konishi.ryusuke@gmail.com Signed-off-by: Ryusuke Konishi <konishi.ryusuke(a)gmail.com> Tested-by: Ryusuke Konishi <konishi.ryusuke(a)gmail.com> Reported-by: syzbot+221d75710bde87fa0e97(a)syzkaller.appspotmail.com Link: https://syzkaller.appspot.com/bug?extid=221d75710bde87fa0e97 Cc: <stable(a)vger.kernel.org> Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org> --- fs/nilfs2/bmap.c | 16 ++++++++++++---- 1 file changed, 12 insertions(+), 4 deletions(-) --- a/fs/nilfs2/bmap.c~nilfs2-fix-infinite-loop-in-nilfs_mdt_get_block +++ a/fs/nilfs2/bmap.c @@ -67,20 +67,28 @@ int nilfs_bmap_lookup_at_level(struct ni down_read(&bmap->b_sem); ret = bmap->b_ops->bop_lookup(bmap, key, level, ptrp); - if (ret < 0) { - ret = nilfs_bmap_convert_error(bmap, __func__, ret); + if (ret < 0) goto out; - } + if (NILFS_BMAP_USE_VBN(bmap)) { ret = nilfs_dat_translate(nilfs_bmap_get_dat(bmap), *ptrp, &blocknr); if (!ret) *ptrp = blocknr; + else if (ret == -ENOENT) { + /* + * If there was no valid entry in DAT for the block + * address obtained by b_ops->bop_lookup, then pass + * internal code -EINVAL to nilfs_bmap_convert_error + * to treat it as metadata corruption. + */ + ret = -EINVAL; + } } out: up_read(&bmap->b_sem); - return ret; + return nilfs_bmap_convert_error(bmap, __func__, ret); } int nilfs_bmap_lookup_contig(struct nilfs_bmap *bmap, __u64 key, __u64 *ptrp, _ Patches currently in -mm which might be from konishi.ryusuke(a)gmail.com are

2 years, 2 months

1
0
0 0

[PATCH 24/24] zonefs: Do not propagate iomap_dio_rw() ENOTBLK error to user space

by himanshu.madhani＠oracle.com

From: Damien Le Moal <damien.lemoal(a)opensource.wdc.com> The call to invalidate_inode_pages2_range() in __iomap_dio_rw() may fail, in which case -ENOTBLK is returned and this error code is propagated back to user space trhough iomap_dio_rw() -> zonefs_file_dio_write() return chain. This error code is fairly obscure and may confuse the user. Avoid this and be consistent with the behavior of zonefs_file_dio_append() for similar invalidate_inode_pages2_range() errors by returning -EBUSY to user space when iomap_dio_rw() returns -ENOTBLK. Suggested-by: Christoph Hellwig <hch(a)infradead.org> Fixes: 8dcc1a9d90c1 ("fs: New zonefs file system") Cc: stable(a)vger.kernel.org Signed-off-by: Damien Le Moal <damien.lemoal(a)opensource.wdc.com> Reviewed-by: Christoph Hellwig <hch(a)lst.de> Reviewed-by: Johannes Thumshirn <johannes.thumshirn(a)wdc.com> Tested-by: Hans Holmberg <hans.holmberg(a)wdc.com> (cherry picked from commit 77af13ba3c7f91d91c377c7e2d122849bbc17128) Orabug: 35351356 Signed-off-by: Himanshu Madhani <himanshu.madhani(a)oracle.com> Conflicts: fs/zonefs/file.c small conflicts due to old api for iomap_dio_rw() --- fs/zonefs/file.c | 13 +++++++++++-- 1 file changed, 11 insertions(+), 2 deletions(-) diff --git a/fs/zonefs/file.c b/fs/zonefs/file.c index 49b934c98c5a..27cd84c5e5d0 100644 --- a/fs/zonefs/file.c +++ b/fs/zonefs/file.c @@ -583,11 +583,20 @@ static ssize_t zonefs_file_dio_write(struct kiocb *iocb, struct iov_iter *from) append = sync; } - if (append) + if (append) { ret = zonefs_file_dio_append(iocb, from); - else + } else { + /* + * iomap_dio_rw() may return ENOTBLK if there was an issue with + * page invalidation. Overwrite that error code with EBUSY to + * be consistent with zonefs_file_dio_append() return value for + * similar issues. + */ ret = iomap_dio_rw(iocb, from, &zonefs_write_iomap_ops, &zonefs_write_dio_ops, 0, 0); + if (ret == -ENOTBLK) + ret = -EBUSY; + } if (zonefs_zone_is_seq(z) && (ret > 0 || ret == -EIOCBQUEUED)) { -- 2.31.1

2 years, 2 months

1
0
0 0

[PATCH 23/24] zonefs: Always invalidate last cached page on append write

by himanshu.madhani＠oracle.com

From: Damien Le Moal <damien.lemoal(a)opensource.wdc.com> When a direct append write is executed, the append offset may correspond to the last page of a sequential file inode which might have been cached already by buffered reads, page faults with mmap-read or non-direct readahead. To ensure that the on-disk and cached data is consistant for such last cached page, make sure to always invalidate it in zonefs_file_dio_append(). If the invalidation fails, return -EBUSY to userspace to differentiate from IO errors. This invalidation will always be a no-op when the FS block size (device zone write granularity) is equal to the page size (e.g. 4K). Reported-by: Hans Holmberg <Hans.Holmberg(a)wdc.com> Fixes: 02ef12a663c7 ("zonefs: use REQ_OP_ZONE_APPEND for sync DIO") Cc: stable(a)vger.kernel.org Signed-off-by: Damien Le Moal <damien.lemoal(a)opensource.wdc.com> Reviewed-by: Christoph Hellwig <hch(a)lst.de> Reviewed-by: Johannes Thumshirn <johannes.thumshirn(a)wdc.com> Tested-by: Hans Holmberg <hans.holmberg(a)wdc.com> (cherry picked from commit c1976bd8f23016d8706973908f2bb0ac0d852a8f) Orabug: 35351356 Signed-off-by: Himanshu Madhani <himanshu.madhani(a)oracle.com> --- fs/zonefs/file.c | 14 ++++++++++++++ 1 file changed, 14 insertions(+) diff --git a/fs/zonefs/file.c b/fs/zonefs/file.c index 5708c54cda69..49b934c98c5a 100644 --- a/fs/zonefs/file.c +++ b/fs/zonefs/file.c @@ -382,6 +382,7 @@ static ssize_t zonefs_file_dio_append(struct kiocb *iocb, struct iov_iter *from) struct zonefs_zone *z = zonefs_inode_zone(inode); struct block_device *bdev = inode->i_sb->s_bdev; unsigned int max = bdev_max_zone_append_sectors(bdev); + pgoff_t start, end; struct bio *bio; ssize_t size = 0; int nr_pages; @@ -390,6 +391,19 @@ static ssize_t zonefs_file_dio_append(struct kiocb *iocb, struct iov_iter *from) max = ALIGN_DOWN(max << SECTOR_SHIFT, inode->i_sb->s_blocksize); iov_iter_truncate(from, max); + /* + * If the inode block size (zone write granularity) is smaller than the + * page size, we may be appending data belonging to the last page of the + * inode straddling inode->i_size, with that page already cached due to + * a buffered read or readahead. So make sure to invalidate that page. + * This will always be a no-op for the case where the block size is + * equal to the page size. + */ + start = iocb->ki_pos >> PAGE_SHIFT; + end = (iocb->ki_pos + iov_iter_count(from) - 1) >> PAGE_SHIFT; + if (invalidate_inode_pages2_range(inode->i_mapping, start, end)) + return -EBUSY; + nr_pages = iov_iter_npages(from, BIO_MAX_VECS); if (!nr_pages) return 0; -- 2.31.1

2 years, 2 months

1
0
0 0

[PATCH 22/24] zonefs: Fix error message in zonefs_file_dio_append()

by himanshu.madhani＠oracle.com

From: Damien Le Moal <damien.lemoal(a)opensource.wdc.com> Since the expected write location in a sequential file is always at the end of the file (append write), when an invalid write append location is detected in zonefs_file_dio_append(), print the invalid written location instead of the expected write location. Fixes: a608da3bd730 ("zonefs: Detect append writes at invalid locations") Cc: stable(a)vger.kernel.org Signed-off-by: Damien Le Moal <damien.lemoal(a)opensource.wdc.com> Reviewed-by: Christoph Hellwig <hch(a)lst.de> Reviewed-by: Johannes Thumshirn <johannes.thumshirn(a)wdc.com> Reviewed-by: Himanshu Madhani <himanshu.madhani(a)oracle.com> (cherry picked from commit 88b170088ad2c3e27086fe35769aa49f8a512564) Orabug: 35351356 Signed-off-by: Himanshu Madhani <himanshu.madhani(a)oracle.com> --- fs/zonefs/file.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/fs/zonefs/file.c b/fs/zonefs/file.c index 82f608d57a84..5708c54cda69 100644 --- a/fs/zonefs/file.c +++ b/fs/zonefs/file.c @@ -428,7 +428,7 @@ static ssize_t zonefs_file_dio_append(struct kiocb *iocb, struct iov_iter *from) if (bio->bi_iter.bi_sector != wpsector) { zonefs_warn(inode->i_sb, "Corrupted write pointer %llu for zone at %llu\n", - wpsector, z->z_sector); + bio->bi_iter.bi_sector, z->z_sector); ret = -EIO; } } -- 2.31.1

2 years, 2 months

1
0
0 0

[PATCH 14/24] zonefs: Fix active zone accounting

by himanshu.madhani＠oracle.com

From: Damien Le Moal <damien.lemoal(a)opensource.wdc.com> If a file zone transitions to the offline or readonly state from an active state, we must clear the zone active flag and decrement the active seq file counter. Do so in zonefs_account_active() using the new zonefs inode flags ZONEFS_ZONE_OFFLINE and ZONEFS_ZONE_READONLY. These flags are set if necessary in zonefs_check_zone_condition() based on the result of report zones operation after an IO error. Fixes: 87c9ce3ffec9 ("zonefs: Add active seq file accounting") Cc: stable(a)vger.kernel.org Signed-off-by: Damien Le Moal <damien.lemoal(a)opensource.wdc.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn(a)wdc.com> (cherry picked from commit db58653ce0c7cf4d155727852607106f890005c0) Orabug: 35351356 Signed-off-by: Himanshu Madhani <himanshu.madhani(a)oracle.com> --- fs/zonefs/super.c | 11 +++++++++++ fs/zonefs/zonefs.h | 6 ++++-- 2 files changed, 15 insertions(+), 2 deletions(-) diff --git a/fs/zonefs/super.c b/fs/zonefs/super.c index 355f2da12374..0552c103fcd8 100644 --- a/fs/zonefs/super.c +++ b/fs/zonefs/super.c @@ -40,6 +40,13 @@ static void zonefs_account_active(struct inode *inode) if (zi->i_ztype != ZONEFS_ZTYPE_SEQ) return; + /* + * For zones that transitioned to the offline or readonly condition, + * we only need to clear the active state. + */ + if (zi->i_flags & (ZONEFS_ZONE_OFFLINE | ZONEFS_ZONE_READONLY)) + goto out; + /* * If the zone is active, that is, if it is explicitly open or * partially written, check if it was already accounted as active. @@ -53,6 +60,7 @@ static void zonefs_account_active(struct inode *inode) return; } +out: /* The zone is not active. If it was, update the active count */ if (zi->i_flags & ZONEFS_ZONE_ACTIVE) { zi->i_flags &= ~ZONEFS_ZONE_ACTIVE; @@ -324,6 +332,7 @@ static loff_t zonefs_check_zone_condition(struct inode *inode, inode->i_flags |= S_IMMUTABLE; inode->i_mode &= ~0777; zone->wp = zone->start; + zi->i_flags |= ZONEFS_ZONE_OFFLINE; return 0; case BLK_ZONE_COND_READONLY: /* @@ -342,8 +351,10 @@ static loff_t zonefs_check_zone_condition(struct inode *inode, zone->cond = BLK_ZONE_COND_OFFLINE; inode->i_mode &= ~0777; zone->wp = zone->start; + zi->i_flags |= ZONEFS_ZONE_OFFLINE; return 0; } + zi->i_flags |= ZONEFS_ZONE_READONLY; inode->i_mode &= ~0222; return i_size_read(inode); case BLK_ZONE_COND_FULL: diff --git a/fs/zonefs/zonefs.h b/fs/zonefs/zonefs.h index 4b3de66c3233..1dbe78119ff1 100644 --- a/fs/zonefs/zonefs.h +++ b/fs/zonefs/zonefs.h @@ -39,8 +39,10 @@ static inline enum zonefs_ztype zonefs_zone_type(struct blk_zone *zone) return ZONEFS_ZTYPE_SEQ; } -#define ZONEFS_ZONE_OPEN (1 << 0) -#define ZONEFS_ZONE_ACTIVE (1 << 1) +#define ZONEFS_ZONE_OPEN (1U << 0) +#define ZONEFS_ZONE_ACTIVE (1U << 1) +#define ZONEFS_ZONE_OFFLINE (1U << 2) +#define ZONEFS_ZONE_READONLY (1U << 3) /* * In-memory inode data. -- 2.31.1

2 years, 2 months

1
0
0 0

+ maple_tree-make-maple-state-reusable-after-mas_empty_area.patch added to mm-hotfixes-unstable branch

by Andrew Morton

The patch titled Subject: maple_tree: make maple state reusable after mas_empty_area() has been added to the -mm mm-hotfixes-unstable branch. Its filename is maple_tree-make-maple-state-reusable-after-mas_empty_area.patch This patch will shortly appear at https://git.kernel.org/pub/scm/linux/kernel/git/akpm/25-new.git/tree/patche… This patch will later appear in the mm-hotfixes-unstable branch at git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Before you just go and hit "reply", please: a) Consider who else should be cc'ed b) Prefer to cc a suitable mailing list as well c) Ideally: find the original patch on the mailing list and do a reply-to-all to that, adding suitable additional cc's *** Remember to use Documentation/process/submit-checklist.rst when testing your code *** The -mm tree is included into linux-next via the mm-everything branch at git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm and is updated there every 2-3 working days ------------------------------------------------------ From: Peng Zhang <zhangpeng.00(a)bytedance.com> Subject: maple_tree: make maple state reusable after mas_empty_area() Date: Fri, 5 May 2023 22:58:29 +0800 Make mas->min and mas->max point to a node range instead of a leaf entry range. This allows mas to still be usable after mas_empty_area() returns. Users would get unexpected results from other operations on the maple state after calling the affected function. For example, x86 MAP_32BIT mmap() acts as if there is no suitable gap when there should be one. Link: https://lkml.kernel.org/r/20230505145829.74574-1-zhangpeng.00@bytedance.com Fixes: 54a611b60590 ("Maple Tree: add new data structure") Signed-off-by: Peng Zhang <zhangpeng.00(a)bytedance.com> Reported-by: "Edgecombe, Rick P" <rick.p.edgecombe(a)intel.com> Reported-by: Tad <support(a)spotco.us> Reported-by: Michael Keyes <mgkeyes(a)vigovproductions.net> Link: https://lore.kernel.org/linux-mm/32f156ba80010fd97dbaf0a0cdfc84366608624d.c… Link: https://lore.kernel.org/linux-mm/e6108286ac025c268964a7ead3aab9899f9bc6e9.c… Reviewed-by: Liam R. Howlett <Liam.Howlett(a)oracle.com> Tested-by: Rick Edgecombe <rick.p.edgecombe(a)intel.com> Cc: <stable(a)vger.kernel.org> Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org> --- lib/maple_tree.c | 12 +++--------- 1 file changed, 3 insertions(+), 9 deletions(-) --- a/lib/maple_tree.c~maple_tree-make-maple-state-reusable-after-mas_empty_area +++ a/lib/maple_tree.c @@ -5317,15 +5317,9 @@ int mas_empty_area(struct ma_state *mas, mt = mte_node_type(mas->node); pivots = ma_pivots(mas_mn(mas), mt); - if (offset) - mas->min = pivots[offset - 1] + 1; - - if (offset < mt_pivots[mt]) - mas->max = pivots[offset]; - - if (mas->index < mas->min) - mas->index = mas->min; - + min = mas_safe_min(mas, pivots, offset); + if (mas->index < min) + mas->index = min; mas->last = mas->index + size - 1; return 0; } _ Patches currently in -mm which might be from zhangpeng.00(a)bytedance.com are maple_tree-make-maple-state-reusable-after-mas_empty_area.patch

2 years, 2 months

1
0
0 0

+ zsmalloc-move-lru-update-from-zs_map_object-to-zs_malloc.patch added to mm-hotfixes-unstable branch

by Andrew Morton

The patch titled Subject: zsmalloc: move LRU update from zs_map_object() to zs_malloc() has been added to the -mm mm-hotfixes-unstable branch. Its filename is zsmalloc-move-lru-update-from-zs_map_object-to-zs_malloc.patch This patch will shortly appear at https://git.kernel.org/pub/scm/linux/kernel/git/akpm/25-new.git/tree/patche… This patch will later appear in the mm-hotfixes-unstable branch at git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Before you just go and hit "reply", please: a) Consider who else should be cc'ed b) Prefer to cc a suitable mailing list as well c) Ideally: find the original patch on the mailing list and do a reply-to-all to that, adding suitable additional cc's *** Remember to use Documentation/process/submit-checklist.rst when testing your code *** The -mm tree is included into linux-next via the mm-everything branch at git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm and is updated there every 2-3 working days ------------------------------------------------------ From: Nhat Pham <nphamcs(a)gmail.com> Subject: zsmalloc: move LRU update from zs_map_object() to zs_malloc() Date: Fri, 5 May 2023 11:50:54 -0700 Under memory pressure, we sometimes observe the following crash: [ 5694.832838] ------------[ cut here ]------------ [ 5694.842093] list_del corruption, ffff888014b6a448->next is LIST_POISON1 (dead000000000100) [ 5694.858677] WARNING: CPU: 33 PID: 418824 at lib/list_debug.c:47 __list_del_entry_valid+0x42/0x80 [ 5694.961820] CPU: 33 PID: 418824 Comm: fuse_counters.s Kdump: loaded Tainted: G S 5.19.0-0_fbk3_rc3_hoangnhatpzsdynshrv41_10870_g85a9558a25de #1 [ 5694.990194] Hardware name: Wiwynn Twin Lakes MP/Twin Lakes Passive MP, BIOS YMM16 05/24/2021 [ 5695.007072] RIP: 0010:__list_del_entry_valid+0x42/0x80 [ 5695.017351] Code: 08 48 83 c2 22 48 39 d0 74 24 48 8b 10 48 39 f2 75 2c 48 8b 51 08 b0 01 48 39 f2 75 34 c3 48 c7 c7 55 d7 78 82 e8 4e 45 3b 00 <0f> 0b eb 31 48 c7 c7 27 a8 70 82 e8 3e 45 3b 00 0f 0b eb 21 48 c7 [ 5695.054919] RSP: 0018:ffffc90027aef4f0 EFLAGS: 00010246 [ 5695.065366] RAX: 41fe484987275300 RBX: ffff888008988180 RCX: 0000000000000000 [ 5695.079636] RDX: ffff88886006c280 RSI: ffff888860060480 RDI: ffff888860060480 [ 5695.093904] RBP: 0000000000000002 R08: 0000000000000000 R09: ffffc90027aef370 [ 5695.108175] R10: 0000000000000000 R11: ffffffff82fdf1c0 R12: 0000000010000002 [ 5695.122447] R13: ffff888014b6a448 R14: ffff888014b6a420 R15: 00000000138dc240 [ 5695.136717] FS: 00007f23a7d3f740(0000) GS:ffff888860040000(0000) knlGS:0000000000000000 [ 5695.152899] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 5695.164388] CR2: 0000560ceaab6ac0 CR3: 000000001c06c001 CR4: 00000000007706e0 [ 5695.178659] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 5695.192927] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [ 5695.207197] PKRU: 55555554 [ 5695.212602] Call Trace: [ 5695.217486] <TASK> [ 5695.221674] zs_map_object+0x91/0x270 [ 5695.229000] zswap_frontswap_store+0x33d/0x870 [ 5695.237885] ? do_raw_spin_lock+0x5d/0xa0 [ 5695.245899] __frontswap_store+0x51/0xb0 [ 5695.253742] swap_writepage+0x3c/0x60 [ 5695.261063] shrink_page_list+0x738/0x1230 [ 5695.269255] shrink_lruvec+0x5ec/0xcd0 [ 5695.276749] ? shrink_slab+0x187/0x5f0 [ 5695.284240] ? mem_cgroup_iter+0x6e/0x120 [ 5695.292255] shrink_node+0x293/0x7b0 [ 5695.299402] do_try_to_free_pages+0xea/0x550 [ 5695.307940] try_to_free_pages+0x19a/0x490 [ 5695.316126] __folio_alloc+0x19ff/0x3e40 [ 5695.323971] ? __filemap_get_folio+0x8a/0x4e0 [ 5695.332681] ? walk_component+0x2a8/0xb50 [ 5695.340697] ? generic_permission+0xda/0x2a0 [ 5695.349231] ? __filemap_get_folio+0x8a/0x4e0 [ 5695.357940] ? walk_component+0x2a8/0xb50 [ 5695.365955] vma_alloc_folio+0x10e/0x570 [ 5695.373796] ? walk_component+0x52/0xb50 [ 5695.381634] wp_page_copy+0x38c/0xc10 [ 5695.388953] ? filename_lookup+0x378/0xbc0 [ 5695.397140] handle_mm_fault+0x87f/0x1800 [ 5695.405157] do_user_addr_fault+0x1bd/0x570 [ 5695.413520] exc_page_fault+0x5d/0x110 [ 5695.421017] asm_exc_page_fault+0x22/0x30 After some investigation, I have found the following issue: unlike other zswap backends, zsmalloc performs the LRU list update at the object mapping time, rather than when the slot for the object is allocated. This deviation was discussed and agreed upon during the review process of the zsmalloc writeback patch series: https://lore.kernel.org/lkml/Y3flcAXNxxrvy3ZH@cmpxchg.org/ Unfortunately, this introduces a subtle bug that occurs when there is a concurrent store and reclaim, which interleave as follows: zswap_frontswap_store() shrink_worker() zs_malloc() zs_zpool_shrink() spin_lock(&pool->lock) zs_reclaim_page() zspage = find_get_zspage() spin_unlock(&pool->lock) spin_lock(&pool->lock) zspage = list_first_entry(&pool->lru) list_del(&zspage->lru) zspage->lru.next = LIST_POISON1 zspage->lru.prev = LIST_POISON2 spin_unlock(&pool->lock) zs_map_object() spin_lock(&pool->lock) if (!list_empty(&zspage->lru)) list_del(&zspage->lru) CHECK_DATA_CORRUPTION(next == LIST_POISON1) /* BOOM */ With the current upstream code, this issue rarely happens. zswap only triggers writeback when the pool is already full, at which point all further store attempts are short-circuited. This creates an implicit pseudo-serialization between reclaim and store. I am working on a new zswap shrinking mechanism, which makes interleaving reclaim and store more likely, exposing this bug. zbud and z3fold do not have this problem, because they perform the LRU list update in the alloc function, while still holding the pool's lock. This patch fixes the aforementioned bug by moving the LRU update back to zs_malloc(), analogous to zbud and z3fold. Link: https://lkml.kernel.org/r/20230505185054.2417128-1-nphamcs@gmail.com Fixes: 64f768c6b32e ("zsmalloc: add a LRU to zs_pool to keep track of zspages in LRU order") Signed-off-by: Nhat Pham <nphamcs(a)gmail.com> Suggested-by: Johannes Weiner <hannes(a)cmpxchg.org> Acked-by: Johannes Weiner <hannes(a)cmpxchg.org> Cc: Dan Streetman <ddstreet(a)ieee.org> Cc: Minchan Kim <minchan(a)kernel.org> Cc: Nitin Gupta <ngupta(a)vflare.org> Cc: Sergey Senozhatsky <senozhatsky(a)chromium.org> Cc: Seth Jennings <sjenning(a)redhat.com> Cc: Vitaly Wool <vitaly.wool(a)konsulko.com> Cc: <stable(a)vger.kernel.org> Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org> --- mm/zsmalloc.c | 36 +++++++++--------------------------- 1 file changed, 9 insertions(+), 27 deletions(-) --- a/mm/zsmalloc.c~zsmalloc-move-lru-update-from-zs_map_object-to-zs_malloc +++ a/mm/zsmalloc.c @@ -1331,31 +1331,6 @@ void *zs_map_object(struct zs_pool *pool obj_to_location(obj, &page, &obj_idx); zspage = get_zspage(page); -#ifdef CONFIG_ZPOOL - /* - * Move the zspage to front of pool's LRU. - * - * Note that this is swap-specific, so by definition there are no ongoing - * accesses to the memory while the page is swapped out that would make - * it "hot". A new entry is hot, then ages to the tail until it gets either - * written back or swaps back in. - * - * Furthermore, map is also called during writeback. We must not put an - * isolated page on the LRU mid-reclaim. - * - * As a result, only update the LRU when the page is mapped for write - * when it's first instantiated. - * - * This is a deviation from the other backends, which perform this update - * in the allocation function (zbud_alloc, z3fold_alloc). - */ - if (mm == ZS_MM_WO) { - if (!list_empty(&zspage->lru)) - list_del(&zspage->lru); - list_add(&zspage->lru, &pool->lru); - } -#endif - /* * migration cannot move any zpages in this zspage. Here, pool->lock * is too heavy since callers would take some time until they calls @@ -1525,9 +1500,8 @@ unsigned long zs_malloc(struct zs_pool * fix_fullness_group(class, zspage); record_obj(handle, obj); class_stat_inc(class, ZS_OBJS_INUSE, 1); - spin_unlock(&pool->lock); - return handle; + goto out; } spin_unlock(&pool->lock); @@ -1550,6 +1524,14 @@ unsigned long zs_malloc(struct zs_pool * /* We completely set up zspage so mark them as movable */ SetZsPageMovable(pool, zspage); +out: +#ifdef CONFIG_ZPOOL + /* Add/move zspage to beginning of LRU */ + if (!list_empty(&zspage->lru)) + list_del(&zspage->lru); + list_add(&zspage->lru, &pool->lru); +#endif + spin_unlock(&pool->lock); return handle; _ Patches currently in -mm which might be from nphamcs(a)gmail.com are zsmalloc-move-lru-update-from-zs_map_object-to-zs_malloc.patch workingset-refactor-lru-refault-to-expose-refault-recency-check.patch cachestat-implement-cachestat-syscall.patch selftests-add-selftests-for-cachestat.patch

2 years, 2 months

1
0
0 0

[PATCH v3 1/2] drm/atomic: Allow vblank-enabled + self-refresh "disable"

by Brian Norris

The self-refresh helper framework overloads "disable" to sometimes mean "go into self-refresh mode," and this mode activates automatically (e.g., after some period of unchanging display output). In such cases, the display pipe is still considered "on", and user-space is not aware that we went into self-refresh mode. Thus, users may expect that vblank-related features (such as DRM_IOCTL_WAIT_VBLANK) still work properly. However, we trigger the WARN_ONCE() here if a CRTC driver tries to leave vblank enabled. Add a different expectation: that CRTCs *should* leave vblank enabled when going into self-refresh. This patch is preparation for another patch -- "drm/rockchip: vop: Leave vblank enabled in self-refresh" -- which resolves conflicts between the above self-refresh behavior and the API tests in IGT's kms_vblank test module. == Some alternatives discussed: == It's likely that on many display controllers, vblank interrupts will turn off when the CRTC is disabled, and so in some cases, self-refresh may not support vblank. To support such cases, we might consider additions to the generic helpers such that we fire vblank events based on a timer. However, there is currently only one driver using the common self-refresh helpers (i.e., rockchip), and at least as of commit bed030a49f3e ("drm/rockchip: Don't fully disable vop on self refresh"), the CRTC hardware is powered enough to continue to generate vblank interrupts. So we chose the simpler option of leaving vblank interrupts enabled. We can reevaluate this decision and perhaps augment the helpers if/when we gain a second driver that has different requirements. v3: * include discussion summary v2: * add 'ret != 0' warning case for self-refresh * describe failing test case and relation to drm/rockchip patch better Cc: <stable(a)vger.kernel.org> # dependency for "drm/rockchip: vop: Leave # vblank enabled in self-refresh" Signed-off-by: Brian Norris <briannorris(a)chromium.org> --- drivers/gpu/drm/drm_atomic_helper.c | 11 ++++++++++- 1 file changed, 10 insertions(+), 1 deletion(-) diff --git a/drivers/gpu/drm/drm_atomic_helper.c b/drivers/gpu/drm/drm_atomic_helper.c index d579fd8f7cb8..a22485e3e924 100644 --- a/drivers/gpu/drm/drm_atomic_helper.c +++ b/drivers/gpu/drm/drm_atomic_helper.c @@ -1209,7 +1209,16 @@ disable_outputs(struct drm_device *dev, struct drm_atomic_state *old_state) continue; ret = drm_crtc_vblank_get(crtc); - WARN_ONCE(ret != -EINVAL, "driver forgot to call drm_crtc_vblank_off()\n"); + /* + * Self-refresh is not a true "disable"; ensure vblank remains + * enabled. + */ + if (new_crtc_state->self_refresh_active) + WARN_ONCE(ret != 0, + "driver disabled vblank in self-refresh\n"); + else + WARN_ONCE(ret != -EINVAL, + "driver forgot to call drm_crtc_vblank_off()\n"); if (ret == 0) drm_crtc_vblank_put(crtc); } -- 2.39.0.314.g84b9a713c41-goog

2 years, 2 months

2
2
0 0

[PATCH v2] maple_tree: Make maple state reusable after mas_empty_area()

by Peng Zhang

Make mas->min and mas->max point to a node range instead of a leaf entry range. This allows mas to still be usable after mas_empty_area() returns. Users would get unexpected results from other operations on the maple state after calling the affected function. Reported-by: "Edgecombe, Rick P" <rick.p.edgecombe(a)intel.com> Reported-by: Tad <support(a)spotco.us> Reported-by: Michael Keyes <mgkeyes(a)vigovproductions.net> Link: https://lore.kernel.org/linux-mm/32f156ba80010fd97dbaf0a0cdfc84366608624d.c… Link: https://lore.kernel.org/linux-mm/e6108286ac025c268964a7ead3aab9899f9bc6e9.c… Fixes: 54a611b60590 ("Maple Tree: add new data structure") Cc: <Stable(a)vger.kernel.org> Signed-off-by: Peng Zhang <zhangpeng.00(a)bytedance.com> --- lib/maple_tree.c | 12 +++--------- 1 file changed, 3 insertions(+), 9 deletions(-) diff --git a/lib/maple_tree.c b/lib/maple_tree.c index 110a36479dced..8ebc43d4cc8c5 100644 --- a/lib/maple_tree.c +++ b/lib/maple_tree.c @@ -5317,15 +5317,9 @@ int mas_empty_area(struct ma_state *mas, unsigned long min, mt = mte_node_type(mas->node); pivots = ma_pivots(mas_mn(mas), mt); - if (offset) - mas->min = pivots[offset - 1] + 1; - - if (offset < mt_pivots[mt]) - mas->max = pivots[offset]; - - if (mas->index < mas->min) - mas->index = mas->min; - + min = mas_safe_min(mas, pivots, offset); + if (mas->index < min) + mas->index = min; mas->last = mas->index + size - 1; return 0; } -- 2.20.1

2 years, 2 months

3
2
0 0

[PATCH] maple_tree: Make maple state reusable after mas_empty_area()

by Liam R. Howlett

Do not update the min and max of the maple state to the slot of the leaf node. Leaving the min and max to the node entry allows for the maple state to be used in other operations. Users would get unexpected results from other operations on the maple state after calling the affected function. Reported-by: "Edgecombe, Rick P" <rick.p.edgecombe(a)intel.com> Reported-by: Tad <support(a)spotco.us> Reported-by: Michael Keyes <mgkeyes(a)vigovproductions.net> Link: https://lore.kernel.org/linux-mm/32f156ba80010fd97dbaf0a0cdfc84366608624d.c… Link: https://lore.kernel.org/linux-mm/e6108286ac025c268964a7ead3aab9899f9bc6e9.c… Fixes: Fixes: 54a611b60590 ("Maple Tree: add new data structure") Cc: <Stable(a)vger.kernel.org> Signed-off-by: Liam R. Howlett <Liam.Howlett(a)oracle.com> --- lib/maple_tree.c | 15 +-------------- 1 file changed, 1 insertion(+), 14 deletions(-) diff --git a/lib/maple_tree.c b/lib/maple_tree.c index 110a36479dced..1c4bc7a988ed3 100644 --- a/lib/maple_tree.c +++ b/lib/maple_tree.c @@ -5285,10 +5285,6 @@ static inline int mas_sparse_area(struct ma_state *mas, unsigned long min, int mas_empty_area(struct ma_state *mas, unsigned long min, unsigned long max, unsigned long size) { - unsigned char offset; - unsigned long *pivots; - enum maple_type mt; - if (min >= max) return -EINVAL; @@ -5311,18 +5307,9 @@ int mas_empty_area(struct ma_state *mas, unsigned long min, if (unlikely(mas_is_err(mas))) return xa_err(mas->node); - offset = mas->offset; - if (unlikely(offset == MAPLE_NODE_SLOTS)) + if (unlikely(mas->offset == MAPLE_NODE_SLOTS)) return -EBUSY; - mt = mte_node_type(mas->node); - pivots = ma_pivots(mas_mn(mas), mt); - if (offset) - mas->min = pivots[offset - 1] + 1; - - if (offset < mt_pivots[mt]) - mas->max = pivots[offset]; - if (mas->index < mas->min) mas->index = mas->min; -- 2.39.2

2 years, 2 months

5
5
0 0

2025

2024

2023

2022

2021

2020

2019

2018

2017

Linux-stable-mirror May 2023