If device_add() fails, do not use device_unregister() for error
handling. device_unregister() consists two functions: device_del() and
put_device(). device_unregister() should only be called after
device_add() succeeded because device_del() undoes what device_add()
does if successful. Change device_unregister() to put_device() call
before returning from the function.
As comment of device_add() says, 'if device_add() succeeds, you should
call device_del() when you want to get rid of it. If device_add() has
not succeeded, use only put_device() to drop the reference count'.
Found by code review.
Cc: stable(a)vger.kernel.org
Fixes: 53d2a715c240 ("phy: Add Tegra XUSB pad controller support")
Signed-off-by: Ma Ke <make24(a)iscas.ac.cn>
---
Changes in v2:
- modified the bug description as suggestions.
---
drivers/phy/tegra/xusb.c | 8 ++++----
1 file changed, 4 insertions(+), 4 deletions(-)
diff --git a/drivers/phy/tegra/xusb.c b/drivers/phy/tegra/xusb.c
index 79d4814d758d..c89df95aa6ca 100644
--- a/drivers/phy/tegra/xusb.c
+++ b/drivers/phy/tegra/xusb.c
@@ -548,16 +548,16 @@ static int tegra_xusb_port_init(struct tegra_xusb_port *port,
err = dev_set_name(&port->dev, "%s-%u", name, index);
if (err < 0)
- goto unregister;
+ goto put_device;
err = device_add(&port->dev);
if (err < 0)
- goto unregister;
+ goto put_device;
return 0;
-unregister:
- device_unregister(&port->dev);
+put_device:
+ put_device(&port->dev);
return err;
}
--
2.25.1
The patch titled
Subject: mm: vmscan : pgdemote vmstat is not getting updated when MGLRU is enabled.
has been added to the -mm mm-hotfixes-unstable branch. Its filename is
mm-vmscan-pgdemote-vmstat-is-not-getting-updated-when-mglru-is-enabled.patch
This patch will shortly appear at
https://git.kernel.org/pub/scm/linux/kernel/git/akpm/25-new.git/tree/patche…
This patch will later appear in the mm-hotfixes-unstable branch at
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Before you just go and hit "reply", please:
a) Consider who else should be cc'ed
b) Prefer to cc a suitable mailing list as well
c) Ideally: find the original patch on the mailing list and do a
reply-to-all to that, adding suitable additional cc's
*** Remember to use Documentation/process/submit-checklist.rst when testing your code ***
The -mm tree is included into linux-next via the mm-everything
branch at git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
and is updated there every 2-3 working days
------------------------------------------------------
From: Donet Tom <donettom(a)linux.ibm.com>
Subject: mm: vmscan : pgdemote vmstat is not getting updated when MGLRU is enabled.
Date: Thu, 9 Jan 2025 00:05:39 -0600
When MGLRU is enabled, the pgdemote_kswapd, pgdemote_direct, and
pgdemote_khugepaged stats in vmstat are not being updated.
Commit f77f0c751478 ("mm,memcg: provide per-cgroup counters for NUMA
balancing operations") moved the pgdemote vmstat update from
demote_folio_list() to shrink_inactive_list(), which is in the normal LRU
path. As a result, the pgdemote stats are updated correctly for the
normal LRU but not for MGLRU.
To address this, we have added the pgdemote stat update in the
evict_folios() function, which is in the MGLRU path. With this patch, the
pgdemote stats will now be updated correctly when MGLRU is enabled.
Without this patch vmstat output when MGLRU is enabled
======================================================
pgdemote_kswapd 0
pgdemote_direct 0
pgdemote_khugepaged 0
With this patch vmstat output when MGLRU is enabled
===================================================
pgdemote_kswapd 43234
pgdemote_direct 4691
pgdemote_khugepaged 0
Link: https://lkml.kernel.org/r/20250109060540.451261-1-donettom@linux.ibm.com
Fixes: f77f0c751478 ("mm,memcg: provide per-cgroup counters for NUMA balancing operations")
Signed-off-by: Donet Tom <donettom(a)linux.ibm.com>
Acked-by: Yu Zhao <yuzhao(a)google.com>
Cc: Aneesh Kumar K.V (Arm) <aneesh.kumar(a)kernel.org>
Cc: David Rientjes <rientjes(a)google.com>
Cc: Johannes Weiner <hannes(a)cmpxchg.org>
Cc: Kaiyang Zhao <kaiyang2(a)cs.cmu.edu>
Cc: Michal Hocko <mhocko(a)kernel.org>
Cc: Muchun Song <muchun.song(a)linux.dev>
Cc: Ritesh Harjani (IBM) <ritesh.list(a)gmail.com>
Cc: Roman Gushchin <roman.gushchin(a)linux.dev>
Cc: Shakeel Butt <shakeel.butt(a)linux.dev>
Cc: Wei Xu <weixugc(a)google.com>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
mm/vmscan.c | 3 +++
1 file changed, 3 insertions(+)
--- a/mm/vmscan.c~mm-vmscan-pgdemote-vmstat-is-not-getting-updated-when-mglru-is-enabled
+++ a/mm/vmscan.c
@@ -4646,6 +4646,9 @@ retry:
reset_batch_size(walk);
}
+ __mod_lruvec_state(lruvec, PGDEMOTE_KSWAPD + reclaimer_offset(),
+ stat.nr_demoted);
+
item = PGSTEAL_KSWAPD + reclaimer_offset();
if (!cgroup_reclaim(sc))
__count_vm_events(item, reclaimed);
_
Patches currently in -mm which might be from donettom(a)linux.ibm.com are
mm-vmscan-pgdemote-vmstat-is-not-getting-updated-when-mglru-is-enabled.patch
mm-migrate-removed-unused-argument-vma-from-migrate_misplaced_folio.patch
selftests-mm-added-new-test-cases-to-the-migration-test.patch
[BUG]
With CONFIG_DEBUG_VM set, test case generic/476 has some chance to crash
with the following VM_BUG_ON_FOLIO():
BTRFS error (device dm-3): cow_file_range failed, start 1146880 end 1253375 len 106496 ret -28
BTRFS error (device dm-3): run_delalloc_nocow failed, start 1146880 end 1253375 len 106496 ret -28
page: refcount:4 mapcount:0 mapping:00000000592787cc index:0x12 pfn:0x10664
aops:btrfs_aops [btrfs] ino:101 dentry name(?):"f1774"
flags: 0x2fffff80004028(uptodate|lru|private|node=0|zone=2|lastcpupid=0xfffff)
page dumped because: VM_BUG_ON_FOLIO(!folio_test_locked(folio))
------------[ cut here ]------------
kernel BUG at mm/page-writeback.c:2992!
Internal error: Oops - BUG: 00000000f2000800 [#1] SMP
CPU: 2 UID: 0 PID: 3943513 Comm: kworker/u24:15 Tainted: G OE 6.12.0-rc7-custom+ #87
Tainted: [O]=OOT_MODULE, [E]=UNSIGNED_MODULE
Hardware name: QEMU KVM Virtual Machine, BIOS unknown 2/2/2022
Workqueue: events_unbound btrfs_async_reclaim_data_space [btrfs]
pc : folio_clear_dirty_for_io+0x128/0x258
lr : folio_clear_dirty_for_io+0x128/0x258
Call trace:
folio_clear_dirty_for_io+0x128/0x258
btrfs_folio_clamp_clear_dirty+0x80/0xd0 [btrfs]
__process_folios_contig+0x154/0x268 [btrfs]
extent_clear_unlock_delalloc+0x5c/0x80 [btrfs]
run_delalloc_nocow+0x5f8/0x760 [btrfs]
btrfs_run_delalloc_range+0xa8/0x220 [btrfs]
writepage_delalloc+0x230/0x4c8 [btrfs]
extent_writepage+0xb8/0x358 [btrfs]
extent_write_cache_pages+0x21c/0x4e8 [btrfs]
btrfs_writepages+0x94/0x150 [btrfs]
do_writepages+0x74/0x190
filemap_fdatawrite_wbc+0x88/0xc8
start_delalloc_inodes+0x178/0x3a8 [btrfs]
btrfs_start_delalloc_roots+0x174/0x280 [btrfs]
shrink_delalloc+0x114/0x280 [btrfs]
flush_space+0x250/0x2f8 [btrfs]
btrfs_async_reclaim_data_space+0x180/0x228 [btrfs]
process_one_work+0x164/0x408
worker_thread+0x25c/0x388
kthread+0x100/0x118
ret_from_fork+0x10/0x20
Code: 910a8021 a90363f7 a9046bf9 94012379 (d4210000)
---[ end trace 0000000000000000 ]---
[CAUSE]
The first two lines of extra debug messages show the problem is caused
by the error handling of run_delalloc_nocow().
E.g. we have the following dirtied range (4K blocksize 4K page size):
0 16K 32K
|//////////////////////////////////////|
| Pre-allocated |
And the range [0, 16K) has a preallocated extent.
- Enter run_delalloc_nocow() for range [0, 16K)
Which found range [0, 16K) is preallocated, can do the proper NOCOW
write.
- Enter fallback_to_fow() for range [16K, 32K)
Since the range [16K, 32K) is not backed by preallocated extent, we
have to go COW.
- cow_file_range() failed for range [16K, 32K)
So cow_file_range() will do the clean up by clearing folio dirty,
unlock the folios.
Now the folios in range [16K, 32K) is unlocked.
- Enter extent_clear_unlock_delalloc() from run_delalloc_nocow()
Which is called with PAGE_START_WRITEBACK to start page writeback.
But folios can only be marked writeback when it's properly locked,
thus this triggered the VM_BUG_ON_FOLIO().
Furthermore there is another hidden but common bug that
run_delalloc_nocow() is not clearing the folio dirty flags in its error
handling path.
This is the common bug shared between run_delalloc_nocow() and
cow_file_range().
[FIX]
- Clear folio dirty for range [@start, @cur_offset)
Introduce a helper, cleanup_dirty_folios(), which
will find and lock the folio in the range, clear the dirty flag and
start/end the writeback, with the extra handling for the
@locked_folio.
- Introduce a helper to record the last failed COW range end
This is to trace which range we should skip, to avoid double
unlocking.
- Skip the failed COW range for the error handling
Cc: stable(a)vger.kernel.org
Signed-off-by: Qu Wenruo <wqu(a)suse.com>
---
fs/btrfs/inode.c | 93 ++++++++++++++++++++++++++++++++++++++++++++----
1 file changed, 86 insertions(+), 7 deletions(-)
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 19c88b7d0363..bae8aceb3eae 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -1961,6 +1961,48 @@ static int can_nocow_file_extent(struct btrfs_path *path,
return ret < 0 ? ret : can_nocow;
}
+static void cleanup_dirty_folios(struct btrfs_inode *inode,
+ struct folio *locked_folio,
+ u64 start, u64 end, int error)
+{
+ struct btrfs_fs_info *fs_info = inode->root->fs_info;
+ struct address_space *mapping = inode->vfs_inode.i_mapping;
+ pgoff_t start_index = start >> PAGE_SHIFT;
+ pgoff_t end_index = end >> PAGE_SHIFT;
+ u32 len;
+
+ ASSERT(end + 1 - start < U32_MAX);
+ ASSERT(IS_ALIGNED(start, fs_info->sectorsize) &&
+ IS_ALIGNED(end + 1, fs_info->sectorsize));
+ len = end + 1 - start;
+
+ /*
+ * Handle the locked folio first.
+ * btrfs_folio_clamp_*() helpers can handle range out of the folio case.
+ */
+ btrfs_folio_clamp_clear_dirty(fs_info, locked_folio, start, len);
+ btrfs_folio_clamp_set_writeback(fs_info, locked_folio, start, len);
+ btrfs_folio_clamp_clear_writeback(fs_info, locked_folio, start, len);
+
+ for (pgoff_t index = start_index; index <= end_index; index++) {
+ struct folio *folio;
+
+ /* Already handled at the beginning. */
+ if (index == locked_folio->index)
+ continue;
+ folio = __filemap_get_folio(mapping, index, FGP_LOCK, GFP_NOFS);
+ /* Cache already dropped, no need to do any cleanup. */
+ if (IS_ERR(folio))
+ continue;
+ btrfs_folio_clamp_clear_dirty(fs_info, folio, start, len);
+ btrfs_folio_clamp_set_writeback(fs_info, folio, start, len);
+ btrfs_folio_clamp_clear_writeback(fs_info, folio, start, len);
+ folio_unlock(folio);
+ folio_put(folio);
+ }
+ mapping_set_error(mapping, error);
+}
+
/*
* when nowcow writeback call back. This checks for snapshots or COW copies
* of the extents that exist in the file, and COWs the file as required.
@@ -1976,6 +2018,11 @@ static noinline int run_delalloc_nocow(struct btrfs_inode *inode,
struct btrfs_root *root = inode->root;
struct btrfs_path *path;
u64 cow_start = (u64)-1;
+ /*
+ * If not 0, represents the inclusive end of the last fallback_to_cow()
+ * range. Only for error handling.
+ */
+ u64 cow_end = 0;
u64 cur_offset = start;
int ret;
bool check_prev = true;
@@ -2136,6 +2183,7 @@ static noinline int run_delalloc_nocow(struct btrfs_inode *inode,
found_key.offset - 1);
cow_start = (u64)-1;
if (ret) {
+ cow_end = found_key.offset - 1;
btrfs_dec_nocow_writers(nocow_bg);
goto error;
}
@@ -2209,11 +2257,12 @@ static noinline int run_delalloc_nocow(struct btrfs_inode *inode,
cow_start = cur_offset;
if (cow_start != (u64)-1) {
- cur_offset = end;
ret = fallback_to_cow(inode, locked_folio, cow_start, end);
cow_start = (u64)-1;
- if (ret)
+ if (ret) {
+ cow_end = end;
goto error;
+ }
}
btrfs_free_path(path);
@@ -2221,12 +2270,42 @@ static noinline int run_delalloc_nocow(struct btrfs_inode *inode,
error:
/*
- * If an error happened while a COW region is outstanding, cur_offset
- * needs to be reset to cow_start to ensure the COW region is unlocked
- * as well.
+ * There are several error cases:
+ *
+ * 1) Failed without falling back to COW
+ * start cur_start end
+ * |/////////////| |
+ *
+ * For range [start, cur_start) the folios are already unlocked (except
+ * @locked_folio), EXTENT_DELALLOC already removed.
+ * Only need to clear the dirty flag as they will never be submitted.
+ * Ordered extent and extent maps are handled by
+ * btrfs_mark_ordered_io_finished() inside run_delalloc_range().
+ *
+ * 2) Failed with error from fallback_to_cow()
+ * start cur_start cow_end end
+ * |/////////////|-----------| |
+ *
+ * For range [start, cur_start) it's the same as case 1).
+ * But for range [cur_start, cow_end), the folios have dirty flag
+ * cleared and unlocked, EXTENT_DEALLLOC cleared.
+ * There may or may not be any ordered extents/extent maps allocated.
+ *
+ * We should not call extent_clear_unlock_delalloc() on range [cur_start,
+ * cow_end), as the folios are already unlocked.
+ *
+ * So clear the folio dirty flags for [start, cur_offset) first.
*/
- if (cow_start != (u64)-1)
- cur_offset = cow_start;
+ if (cur_offset > start)
+ cleanup_dirty_folios(inode, locked_folio, start, cur_offset - 1, ret);
+
+ /*
+ * If an error happened while a COW region is outstanding, cur_offset
+ * needs to be reset to @cow_end + 1 to skip the COW range, as
+ * cow_file_range() will do the proper cleanup at error.
+ */
+ if (cow_end)
+ cur_offset = cow_end + 1;
/*
* We need to lock the extent here because we're clearing DELALLOC and
--
2.47.1
[BUG]
When testing with COW fixup marked as BUG_ON() (this is involved with the
new pin_user_pages*() change, which should not result new out-of-band
dirty pages), I hit a crash triggered by the BUG_ON() from hitting COW
fixup path.
This BUG_ON() happens just after a failed btrfs_run_delalloc_range():
BTRFS error (device dm-2): failed to run delalloc range, root 348 ino 405 folio 65536 submit_bitmap 6-15 start 90112 len 106496: -28
------------[ cut here ]------------
kernel BUG at fs/btrfs/extent_io.c:1444!
Internal error: Oops - BUG: 00000000f2000800 [#1] SMP
CPU: 0 UID: 0 PID: 434621 Comm: kworker/u24:8 Tainted: G OE 6.12.0-rc7-custom+ #86
Hardware name: QEMU KVM Virtual Machine, BIOS unknown 2/2/2022
Workqueue: events_unbound btrfs_async_reclaim_data_space [btrfs]
pc : extent_writepage_io+0x2d4/0x308 [btrfs]
lr : extent_writepage_io+0x2d4/0x308 [btrfs]
Call trace:
extent_writepage_io+0x2d4/0x308 [btrfs]
extent_writepage+0x218/0x330 [btrfs]
extent_write_cache_pages+0x1d4/0x4b0 [btrfs]
btrfs_writepages+0x94/0x150 [btrfs]
do_writepages+0x74/0x190
filemap_fdatawrite_wbc+0x88/0xc8
start_delalloc_inodes+0x180/0x3b0 [btrfs]
btrfs_start_delalloc_roots+0x174/0x280 [btrfs]
shrink_delalloc+0x114/0x280 [btrfs]
flush_space+0x250/0x2f8 [btrfs]
btrfs_async_reclaim_data_space+0x180/0x228 [btrfs]
process_one_work+0x164/0x408
worker_thread+0x25c/0x388
kthread+0x100/0x118
ret_from_fork+0x10/0x20
Code: aa1403e1 9402f3ef aa1403e0 9402f36f (d4210000)
---[ end trace 0000000000000000 ]---
[CAUSE]
That failure is mostly from cow_file_range(), where we can hit -ENOSPC.
Although the -ENOSPC is already a bug related to our space reservation
code, let's just focus on the error handling.
For example, we have the following dirty range [0, 64K) of an inode,
with 4K sector size and 4K page size:
0 16K 32K 48K 64K
|///////////////////////////////////////|
|#######################################|
Where |///| means page are still dirty, and |###| means the extent io
tree has EXTENT_DELALLOC flag.
- Enter extent_writepage() for page 0
- Enter btrfs_run_delalloc_range() for range [0, 64K)
- Enter cow_file_range() for range [0, 64K)
- Function btrfs_reserve_extent() only reserved one 16K extent
So we created extent map and ordered extent for range [0, 16K)
0 16K 32K 48K 64K
|////////|//////////////////////////////|
|<- OE ->|##############################|
And range [0, 16K) has its delalloc flag cleared.
But since we haven't yet submit any bio, involved 4 pages are still
dirty.
- Function btrfs_reserve_extent() return with -ENOSPC
Now we have to run error cleanup, which will clear all
EXTENT_DELALLOC* flags and clear the dirty flags for the remaining
ranges:
0 16K 32K 48K 64K
|////////| |
| | |
Note that range [0, 16K) still has their pages dirty.
- Some time later, writeback are triggered again for the range [0, 16K)
since the page range still have dirty flags.
- btrfs_run_delalloc_range() will do nothing because there is no
EXTENT_DELALLOC flag.
- extent_writepage_io() find page 0 has no ordered flag
Which falls into the COW fixup path, triggering the BUG_ON().
Unfortunately this error handling bug dates back to the introduction of btrfs.
Thankfully with the abuse of cow fixup, at least it won't crash the
kernel.
[FIX]
Instead of immediately unlock the extent and folios, we keep the extent
and folios locked until either erroring out or the whole delalloc range
finished.
When the whole delalloc range finished without error, we just unlock the
whole range with PAGE_SET_ORDERED (and PAGE_UNLOCK for !keep_locked
cases), with EXTENT_DELALLOC and EXTENT_LOCKED cleared.
And those involved folios will be properly submitted, with their dirty
flags cleared during submission.
For the error path, it will be a little more complex:
- The range with ordered extent allocated (range (1))
We only clear the EXTENT_DELALLOC and EXTENT_LOCKED, as the remaining
flags are cleaned up by
btrfs_mark_ordered_io_finished()->btrfs_finish_one_ordered().
For folios we finish the IO (clear dirty, start writeback and
immediately finish the writeback) and unlock the folios.
- The range with reserved extent but no ordered extent (range(2))
- The range we never touched (range(3))
For both range (2) and range(3) the behavior is not changed.
Now even if cow_file_range() failed halfway with some successfully
reserved extents/ordered extents, we will keep all folios clean, so
there will be no future writeback triggered on them.
Cc: stable(a)vger.kernel.org
Signed-off-by: Qu Wenruo <wqu(a)suse.com>
---
fs/btrfs/inode.c | 65 ++++++++++++++++++++++++------------------------
1 file changed, 32 insertions(+), 33 deletions(-)
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 5ba8d044757b..19c88b7d0363 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -1364,6 +1364,17 @@ static noinline int cow_file_range(struct btrfs_inode *inode,
alloc_hint = btrfs_get_extent_allocation_hint(inode, start, num_bytes);
+ /*
+ * We're not doing compressed IO, don't unlock the first page
+ * (which the caller expects to stay locked), don't clear any
+ * dirty bits and don't set any writeback bits
+ *
+ * Do set the Ordered (Private2) bit so we know this page was
+ * properly setup for writepage.
+ */
+ page_ops = (keep_locked ? 0 : PAGE_UNLOCK);
+ page_ops |= PAGE_SET_ORDERED;
+
/*
* Relocation relies on the relocated extents to have exactly the same
* size as the original extents. Normally writeback for relocation data
@@ -1423,6 +1434,10 @@ static noinline int cow_file_range(struct btrfs_inode *inode,
file_extent.offset = 0;
file_extent.compression = BTRFS_COMPRESS_NONE;
+ /*
+ * Locked range will be released either during error clean up or
+ * after the whole range is finished.
+ */
lock_extent(&inode->io_tree, start, start + cur_alloc_size - 1,
&cached);
@@ -1468,21 +1483,6 @@ static noinline int cow_file_range(struct btrfs_inode *inode,
btrfs_dec_block_group_reservations(fs_info, ins.objectid);
- /*
- * We're not doing compressed IO, don't unlock the first page
- * (which the caller expects to stay locked), don't clear any
- * dirty bits and don't set any writeback bits
- *
- * Do set the Ordered flag so we know this page was
- * properly setup for writepage.
- */
- page_ops = (keep_locked ? 0 : PAGE_UNLOCK);
- page_ops |= PAGE_SET_ORDERED;
-
- extent_clear_unlock_delalloc(inode, start, start + cur_alloc_size - 1,
- locked_folio, &cached,
- EXTENT_LOCKED | EXTENT_DELALLOC,
- page_ops);
if (num_bytes < cur_alloc_size)
num_bytes = 0;
else
@@ -1499,6 +1499,9 @@ static noinline int cow_file_range(struct btrfs_inode *inode,
if (ret)
goto out_unlock;
}
+ extent_clear_unlock_delalloc(inode, orig_start, end, locked_folio, &cached,
+ EXTENT_LOCKED | EXTENT_DELALLOC,
+ page_ops);
done:
if (done_offset)
*done_offset = end;
@@ -1519,35 +1522,31 @@ static noinline int cow_file_range(struct btrfs_inode *inode,
* We process each region below.
*/
- clear_bits = EXTENT_LOCKED | EXTENT_DELALLOC | EXTENT_DELALLOC_NEW |
- EXTENT_DEFRAG | EXTENT_CLEAR_META_RESV;
- page_ops = PAGE_UNLOCK | PAGE_START_WRITEBACK | PAGE_END_WRITEBACK;
-
/*
* For the range (1). We have already instantiated the ordered extents
* for this region. They are cleaned up by
* btrfs_cleanup_ordered_extents() in e.g,
- * btrfs_run_delalloc_range(). EXTENT_LOCKED | EXTENT_DELALLOC are
- * already cleared in the above loop. And, EXTENT_DELALLOC_NEW |
- * EXTENT_DEFRAG | EXTENT_CLEAR_META_RESV are handled by the cleanup
- * function.
+ * btrfs_run_delalloc_range().
+ * EXTENT_DELALLOC_NEW | EXTENT_DEFRAG | EXTENT_CLEAR_META_RESV
+ * are also handled by the cleanup function.
*
- * However, in case of @keep_locked, we still need to unlock the pages
- * (except @locked_folio) to ensure all the pages are unlocked.
+ * So here we only clear EXTENT_LOCKED and EXTENT_DELALLOC flag,
+ * and finish the writeback of the involved folios, which will be
+ * never submitted.
*/
- if (keep_locked && orig_start < start) {
+ if (orig_start < start) {
+ clear_bits = EXTENT_LOCKED | EXTENT_DELALLOC;
+ page_ops = PAGE_UNLOCK | PAGE_START_WRITEBACK | PAGE_END_WRITEBACK;
+
if (!locked_folio)
mapping_set_error(inode->vfs_inode.i_mapping, ret);
extent_clear_unlock_delalloc(inode, orig_start, start - 1,
- locked_folio, NULL, 0, page_ops);
+ locked_folio, NULL, clear_bits, page_ops);
}
- /*
- * At this point we're unlocked, we want to make sure we're only
- * clearing these flags under the extent lock, so lock the rest of the
- * range and clear everything up.
- */
- lock_extent(&inode->io_tree, start, end, NULL);
+ clear_bits = EXTENT_LOCKED | EXTENT_DELALLOC | EXTENT_DELALLOC_NEW |
+ EXTENT_DEFRAG | EXTENT_CLEAR_META_RESV;
+ page_ops = PAGE_UNLOCK | PAGE_START_WRITEBACK | PAGE_END_WRITEBACK;
/*
* For the range (2). If we reserved an extent for our delalloc range
--
2.47.1
[BUG]
If submit_one_sector() failed inside extent_writepage_io() for sector
size < page size cases (e.g. 4K sector size and 64K page size), then
we can hit double ordered extent accounting error.
This should be very rare, as submit_one_sector() only fails when we
failed to grab the extent map, and such extent map should exist inside
the memory and have been pinned.
[CAUSE]
For example we have the following folio layout:
0 4K 32K 48K 60K 64K
|//| |//////| |///|
Where |///| is the dirty range we need to writeback. The 3 different
dirty ranges are submitted for regular COW.
Now we hit the following sequence:
- submit_one_sector() returned 0 for [0, 4K)
- submit_one_sector() returned 0 for [32K, 48K)
- submit_one_sector() returned error for [60K, 64K)
- btrfs_mark_ordered_io_finished() called for the whole folio
This will mark the following ranges as finished:
* [0, 4K)
* [32K, 48K)
Both ranges have their IO already submitted, this cleanup will
lead to double accounting.
* [60K, 64K)
That's the correct cleanup.
The only good news is, this error is only theoretical, as the target
extent map is always pinned, thus we should directly grab it from
memory, other than reading it from the disk.
[FIX]
Instead of calling btrfs_mark_ordered_io_finished() for the whole folio
range, which can touch ranges we should not touch, instead
move the error handling inside extent_writepage_io().
So that we can cleanup exact sectors that are ought to be submitted but
failed.
This provide much more accurate cleanup, avoiding the double accounting.
Cc: stable(a)vger.kernel.org # 5.15+
Signed-off-by: Qu Wenruo <wqu(a)suse.com>
---
fs/btrfs/extent_io.c | 32 +++++++++++++++++++-------------
1 file changed, 19 insertions(+), 13 deletions(-)
diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index 417c710c55ca..b6a4f1765b4c 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -1418,6 +1418,7 @@ static noinline_for_stack int extent_writepage_io(struct btrfs_inode *inode,
struct btrfs_fs_info *fs_info = inode->root->fs_info;
unsigned long range_bitmap = 0;
bool submitted_io = false;
+ bool error = false;
const u64 folio_start = folio_pos(folio);
u64 cur;
int bit;
@@ -1460,11 +1461,21 @@ static noinline_for_stack int extent_writepage_io(struct btrfs_inode *inode,
break;
}
ret = submit_one_sector(inode, folio, cur, bio_ctrl, i_size);
- if (ret < 0)
- goto out;
+ if (unlikely(ret < 0)) {
+ submit_one_bio(bio_ctrl);
+ /*
+ * Failed to grab the extent map which should be very rare.
+ * Since there is no bio submitted to finish the ordered
+ * extent, we have to manually finish this sector.
+ */
+ btrfs_mark_ordered_io_finished(inode, folio, cur,
+ fs_info->sectorsize, false);
+ error = true;
+ continue;
+ }
submitted_io = true;
}
-out:
+
/*
* If we didn't submitted any sector (>= i_size), folio dirty get
* cleared but PAGECACHE_TAG_DIRTY is not cleared (only cleared
@@ -1472,8 +1483,11 @@ static noinline_for_stack int extent_writepage_io(struct btrfs_inode *inode,
*
* Here we set writeback and clear for the range. If the full folio
* is no longer dirty then we clear the PAGECACHE_TAG_DIRTY tag.
+ *
+ * If we hit any error, the corresponding sector will still be dirty
+ * thus no need to clear PAGECACHE_TAG_DIRTY.
*/
- if (!submitted_io) {
+ if (!submitted_io && !error) {
btrfs_folio_set_writeback(fs_info, folio, start, len);
btrfs_folio_clear_writeback(fs_info, folio, start, len);
}
@@ -1493,7 +1507,6 @@ static int extent_writepage(struct folio *folio, struct btrfs_bio_ctrl *bio_ctrl
{
struct inode *inode = folio->mapping->host;
struct btrfs_fs_info *fs_info = inode_to_fs_info(inode);
- const u64 page_start = folio_pos(folio);
int ret;
size_t pg_offset;
loff_t i_size = i_size_read(inode);
@@ -1536,10 +1549,6 @@ static int extent_writepage(struct folio *folio, struct btrfs_bio_ctrl *bio_ctrl
bio_ctrl->wbc->nr_to_write--;
- if (ret)
- btrfs_mark_ordered_io_finished(BTRFS_I(inode), folio,
- page_start, PAGE_SIZE, !ret);
-
done:
if (ret < 0)
mapping_set_error(folio->mapping, ret);
@@ -2319,11 +2328,8 @@ void extent_write_locked_range(struct inode *inode, const struct folio *locked_f
if (ret == 1)
goto next_page;
- if (ret) {
- btrfs_mark_ordered_io_finished(BTRFS_I(inode), folio,
- cur, cur_len, !ret);
+ if (ret)
mapping_set_error(mapping, ret);
- }
btrfs_folio_end_lock(fs_info, folio, cur, cur_len);
if (ret < 0)
found_error = true;
--
2.47.1
This partially reverts commit fe4f5d4b6616 ("drm/xe: Clean up VM / exec
queue file lock usage."). While it's desired to have the mutex to
protect only the reference to the exec queue, getting and dropping each
mutex and then later getting the GPU timestamp, doesn't produce a
correct result: it introduces multiple opportunities for the task to be
scheduled out and thus wrecking havoc the deltas reported to userspace.
Also, to better correlate the timestamp from the exec queues with the
GPU, disable preemption so they can be updated without allowing the task
to be scheduled out. We leave interrupts enabled as that shouldn't be
enough disturbance for the deltas to matter to userspace.
Test scenario:
* IGT'S `xe_drm_fdinfo --r --r utilization-single-full-load`
* Platform: LNL, where CI occasionally reports failures
* `stress -c $(nproc)` running in parallel to disturb the
system
This brings a first failure from "after ~150 executions" to "never
occurs after 1000 attempts".
Cc: stable(a)vger.kernel.org # v6.11+
Closes: https://gitlab.freedesktop.org/drm/xe/kernel/-/issues/3512
Signed-off-by: Lucas De Marchi <lucas.demarchi(a)intel.com>
---
drivers/gpu/drm/xe/xe_drm_client.c | 9 +++------
1 file changed, 3 insertions(+), 6 deletions(-)
diff --git a/drivers/gpu/drm/xe/xe_drm_client.c b/drivers/gpu/drm/xe/xe_drm_client.c
index 298a587da7f17..e307b4d6bab5a 100644
--- a/drivers/gpu/drm/xe/xe_drm_client.c
+++ b/drivers/gpu/drm/xe/xe_drm_client.c
@@ -338,15 +338,12 @@ static void show_run_ticks(struct drm_printer *p, struct drm_file *file)
/* Accumulate all the exec queues from this client */
mutex_lock(&xef->exec_queue.lock);
- xa_for_each(&xef->exec_queue.xa, i, q) {
- xe_exec_queue_get(q);
- mutex_unlock(&xef->exec_queue.lock);
+ preempt_disable();
+ xa_for_each(&xef->exec_queue.xa, i, q)
xe_exec_queue_update_run_ticks(q);
- mutex_lock(&xef->exec_queue.lock);
- xe_exec_queue_put(q);
- }
+ preempt_enable();
mutex_unlock(&xef->exec_queue.lock);
gpu_timestamp = xe_hw_engine_read_timestamp(hwe);
--
2.47.0
Avoid a NULL pointer dereference if the memslot table changes between the
exit and the call to kvmppc_e500_shadow_map().
Cc: stable(a)vger.kernel.org
Signed-off-by: Paolo Bonzini <pbonzini(a)redhat.com>
---
arch/powerpc/kvm/e500_mmu_host.c | 5 +++++
1 file changed, 5 insertions(+)
diff --git a/arch/powerpc/kvm/e500_mmu_host.c b/arch/powerpc/kvm/e500_mmu_host.c
index e5a145b578a4..732335444d68 100644
--- a/arch/powerpc/kvm/e500_mmu_host.c
+++ b/arch/powerpc/kvm/e500_mmu_host.c
@@ -349,6 +349,11 @@ static inline int kvmppc_e500_shadow_map(struct kvmppc_vcpu_e500 *vcpu_e500,
* pointer through from the first lookup.
*/
slot = gfn_to_memslot(vcpu_e500->vcpu.kvm, gfn);
+ if (!slot) {
+ ret = -EAGAIN;
+ goto out;
+ }
+
hva = gfn_to_hva_memslot(slot, gfn);
if (tlbsel == 1) {
--
2.47.1
This series collects the various SME related fixes that were previously
posted separately. These should address all the issues I am aware of so
a patch which reenables the SME configuration option is also included.
Signed-off-by: Mark Brown <broonie(a)kernel.org>
---
Changes in v2:
- Pull simplification of the signal restore code after the SME
reenablement, it's not a fix but there's some code overlap.
- Comment updates.
- Link to v1: https://lore.kernel.org/r/20241203-arm64-sme-reenable-v1-0-d853479d1b77@ker…
---
Mark Brown (6):
arm64/sme: Flush foreign register state in do_sme_acc()
arm64/fp: Don't corrupt FPMR when streaming mode changes
arm64/ptrace: Zero FPMR on streaming mode entry/exit
arm64/signal: Avoid corruption of SME state when entering signal handler
arm64/sme: Reenable SME
arm64/signal: Consistently invalidate the in register FP state in restore
arch/arm64/Kconfig | 1 -
arch/arm64/include/asm/fpsimd.h | 1 +
arch/arm64/kernel/fpsimd.c | 57 ++++++++++++++++++++++----
arch/arm64/kernel/ptrace.c | 12 +++++-
arch/arm64/kernel/signal.c | 89 +++++++++++------------------------------
5 files changed, 84 insertions(+), 76 deletions(-)
---
base-commit: 40384c840ea1944d7c5a392e8975ed088ecf0b37
change-id: 20241202-arm64-sme-reenable-98e64c161a8e
Best regards,
--
Mark Brown <broonie(a)kernel.org>
As pointed out by Al Viro and Eric Dumazet in [1], using the 'net'
structure via 'current' is not recommended for different reasons:
- Inconsistency: getting info from the reader's/writer's netns vs only
from the opener's netns as it is usually done. This could cause
unexpected issues when other operations are done on the wrong netns.
- current->nsproxy can be NULL in some cases, resulting in an 'Oops'
(null-ptr-deref), e.g. when the current task is exiting, as spotted by
syzbot [1] using acct(2).
The 'net' or 'pernet' structure can be obtained from the table->data
using container_of().
Note that table->data could also be used directly in more places, but
that would increase the size of this fix to replace all accesses via
'net'. Probably best to avoid that for fixes.
Patches 2-9 remove access of net via current->nsproxy in sysfs handlers
in MPTCP, SCTP and RDS. There are multiple patches doing almost the same
thing, but the reason is to ease the backports.
Patch 1 is not directly linked to this, but it is a small fix for MPTCP
available_schedulers sysctl knob to explicitly mark it as read-only.
Please note that this series does not address Al's comment [2]. In SCTP,
some sysctl knobs set other sysfs-exposed variables for the min/max: two
processes could then write two linked values at the same time, resulting
in new values being outside the new boundaries. It would be great if
SCTP developers can look at this problem.
Link: https://lore.kernel.org/67769ecb.050a0220.3a8527.003f.GAE@google.com [1]
Link: https://lore.kernel.org/netdev/20250105211158.GL1977892@ZenIV/ [2]
Signed-off-by: Matthieu Baerts (NGI0) <matttbe(a)kernel.org>
---
Matthieu Baerts (NGI0) (9):
mptcp: sysctl: avail sched: remove write access
mptcp: sysctl: sched: avoid using current->nsproxy
mptcp: sysctl: blackhole timeout: avoid using current->nsproxy
sctp: sysctl: cookie_hmac_alg: avoid using current->nsproxy
sctp: sysctl: rto_min/max: avoid using current->nsproxy
sctp: sysctl: auth_enable: avoid using current->nsproxy
sctp: sysctl: udp_port: avoid using current->nsproxy
sctp: sysctl: plpmtud_probe_interval: avoid using current->nsproxy
rds: sysctl: rds_tcp_{rcv,snd}buf: avoid using current->nsproxy
net/mptcp/ctrl.c | 17 +++++++++--------
net/rds/tcp.c | 39 ++++++++++++++++++++++++++++++++-------
net/sctp/sysctl.c | 14 ++++++++------
3 files changed, 49 insertions(+), 21 deletions(-)
---
base-commit: db78475ba0d3c66d430f7ded2388cc041078a542
change-id: 20250108-net-sysctl-current-nsproxy-672ae21a873f
Best regards,
--
Matthieu Baerts (NGI0) <matttbe(a)kernel.org>