[BUG]
When testing with COW fixup marked as BUG_ON() (this is involved with the
new pin_user_pages*() change, which should not result new out-of-band
dirty pages), I hit a crash triggered by the BUG_ON() from hitting COW
fixup path.
This BUG_ON() happens just after a failed btrfs_run_delalloc_range():
BTRFS error (device dm-2): failed to run delalloc range, root 348 ino 405 folio 65536 submit_bitmap 6-15 start 90112 len 106496: -28
------------[ cut here ]------------
kernel BUG at fs/btrfs/extent_io.c:1444!
Internal error: Oops - BUG: 00000000f2000800 [#1] SMP
CPU: 0 UID: 0 PID: 434621 Comm: kworker/u24:8 Tainted: G OE 6.12.0-rc7-custom+ #86
Hardware name: QEMU KVM Virtual Machine, BIOS unknown 2/2/2022
Workqueue: events_unbound btrfs_async_reclaim_data_space [btrfs]
pc : extent_writepage_io+0x2d4/0x308 [btrfs]
lr : extent_writepage_io+0x2d4/0x308 [btrfs]
Call trace:
extent_writepage_io+0x2d4/0x308 [btrfs]
extent_writepage+0x218/0x330 [btrfs]
extent_write_cache_pages+0x1d4/0x4b0 [btrfs]
btrfs_writepages+0x94/0x150 [btrfs]
do_writepages+0x74/0x190
filemap_fdatawrite_wbc+0x88/0xc8
start_delalloc_inodes+0x180/0x3b0 [btrfs]
btrfs_start_delalloc_roots+0x174/0x280 [btrfs]
shrink_delalloc+0x114/0x280 [btrfs]
flush_space+0x250/0x2f8 [btrfs]
btrfs_async_reclaim_data_space+0x180/0x228 [btrfs]
process_one_work+0x164/0x408
worker_thread+0x25c/0x388
kthread+0x100/0x118
ret_from_fork+0x10/0x20
Code: aa1403e1 9402f3ef aa1403e0 9402f36f (d4210000)
---[ end trace 0000000000000000 ]---
[CAUSE]
That failure is mostly from cow_file_range(), where we can hit -ENOSPC.
Although the -ENOSPC is already a bug related to our space reservation
code, let's just focus on the error handling.
For example, we have the following dirty range [0, 64K) of an inode,
with 4K sector size and 4K page size:
0 16K 32K 48K 64K
|///////////////////////////////////////|
|#######################################|
Where |///| means page are still dirty, and |###| means the extent io
tree has EXTENT_DELALLOC flag.
- Enter extent_writepage() for page 0
- Enter btrfs_run_delalloc_range() for range [0, 64K)
- Enter cow_file_range() for range [0, 64K)
- Function btrfs_reserve_extent() only reserved one 16K extent
So we created extent map and ordered extent for range [0, 16K)
0 16K 32K 48K 64K
|////////|//////////////////////////////|
|<- OE ->|##############################|
And range [0, 16K) has its delalloc flag cleared.
But since we haven't yet submit any bio, involved 4 pages are still
dirty.
- Function btrfs_reserve_extent() return with -ENOSPC
Now we have to run error cleanup, which will clear all
EXTENT_DELALLOC* flags and clear the dirty flags for the remaining
ranges:
0 16K 32K 48K 64K
|////////| |
| | |
Note that range [0, 16K) still has their pages dirty.
- Some time later, writeback are triggered again for the range [0, 16K)
since the page range still have dirty flags.
- btrfs_run_delalloc_range() will do nothing because there is no
EXTENT_DELALLOC flag.
- extent_writepage_io() find page 0 has no ordered flag
Which falls into the COW fixup path, triggering the BUG_ON().
Unfortunately this error handling bug dates back to the introduction of btrfs.
Thankfully with the abuse of cow fixup, at least it won't crash the
kernel.
[FIX]
Instead of immediately unlock the extent and folios, we keep the extent
and folios locked until either erroring out or the whole delalloc range
finished.
When the whole delalloc range finished without error, we just unlock the
whole range with PAGE_SET_ORDERED (and PAGE_UNLOCK for !keep_locked
cases), with EXTENT_DELALLOC and EXTENT_LOCKED cleared.
And those involved folios will be properly submitted, with their dirty
flags cleared during submission.
For the error path, it will be a little more complex:
- The range with ordered extent allocated (range (1))
We only clear the EXTENT_DELALLOC and EXTENT_LOCKED, as the remaining
flags are cleaned up by
btrfs_mark_ordered_io_finished()->btrfs_finish_one_ordered().
For folios we finish the IO (clear dirty, start writeback and
immediately finish the writeback) and unlock the folios.
- The range with reserved extent but no ordered extent (range(2))
- The range we never touched (range(3))
For both range (2) and range(3) the behavior is not changed.
Now even if cow_file_range() failed halfway with some successfully
reserved extents/ordered extents, we will keep all folios clean, so
there will be no future writeback triggered on them.
Cc: stable(a)vger.kernel.org
Signed-off-by: Qu Wenruo <wqu(a)suse.com>
---
Changelog:
v2:
- Fix the error handling for range (1)
Where the patch is still using the old "0", which will not unlock the
extent range.
---
fs/btrfs/inode.c | 65 ++++++++++++++++++++++++------------------------
1 file changed, 32 insertions(+), 33 deletions(-)
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 9267861f8ab0..9517fb2df649 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -1372,6 +1372,17 @@ static noinline int cow_file_range(struct btrfs_inode *inode,
alloc_hint = btrfs_get_extent_allocation_hint(inode, start, num_bytes);
+ /*
+ * We're not doing compressed IO, don't unlock the first page
+ * (which the caller expects to stay locked), don't clear any
+ * dirty bits and don't set any writeback bits
+ *
+ * Do set the Ordered (Private2) bit so we know this page was
+ * properly setup for writepage.
+ */
+ page_ops = (keep_locked ? 0 : PAGE_UNLOCK);
+ page_ops |= PAGE_SET_ORDERED;
+
/*
* Relocation relies on the relocated extents to have exactly the same
* size as the original extents. Normally writeback for relocation data
@@ -1431,6 +1442,10 @@ static noinline int cow_file_range(struct btrfs_inode *inode,
file_extent.offset = 0;
file_extent.compression = BTRFS_COMPRESS_NONE;
+ /*
+ * Locked range will be released either during error clean up or
+ * after the whole range is finished.
+ */
lock_extent(&inode->io_tree, start, start + cur_alloc_size - 1,
&cached);
@@ -1476,21 +1491,6 @@ static noinline int cow_file_range(struct btrfs_inode *inode,
btrfs_dec_block_group_reservations(fs_info, ins.objectid);
- /*
- * We're not doing compressed IO, don't unlock the first page
- * (which the caller expects to stay locked), don't clear any
- * dirty bits and don't set any writeback bits
- *
- * Do set the Ordered (Private2) bit so we know this page was
- * properly setup for writepage.
- */
- page_ops = (keep_locked ? 0 : PAGE_UNLOCK);
- page_ops |= PAGE_SET_ORDERED;
-
- extent_clear_unlock_delalloc(inode, start, start + cur_alloc_size - 1,
- locked_folio, &cached,
- EXTENT_LOCKED | EXTENT_DELALLOC,
- page_ops);
if (num_bytes < cur_alloc_size)
num_bytes = 0;
else
@@ -1507,6 +1507,9 @@ static noinline int cow_file_range(struct btrfs_inode *inode,
if (ret)
goto out_unlock;
}
+ extent_clear_unlock_delalloc(inode, orig_start, end, locked_folio, &cached,
+ EXTENT_LOCKED | EXTENT_DELALLOC,
+ page_ops);
done:
if (done_offset)
*done_offset = end;
@@ -1527,35 +1530,31 @@ static noinline int cow_file_range(struct btrfs_inode *inode,
* We process each region below.
*/
- clear_bits = EXTENT_LOCKED | EXTENT_DELALLOC | EXTENT_DELALLOC_NEW |
- EXTENT_DEFRAG | EXTENT_CLEAR_META_RESV;
- page_ops = PAGE_UNLOCK | PAGE_START_WRITEBACK | PAGE_END_WRITEBACK;
-
/*
* For the range (1). We have already instantiated the ordered extents
* for this region. They are cleaned up by
* btrfs_cleanup_ordered_extents() in e.g,
- * btrfs_run_delalloc_range(). EXTENT_LOCKED | EXTENT_DELALLOC are
- * already cleared in the above loop. And, EXTENT_DELALLOC_NEW |
- * EXTENT_DEFRAG | EXTENT_CLEAR_META_RESV are handled by the cleanup
- * function.
+ * btrfs_run_delalloc_range().
+ * EXTENT_DELALLOC_NEW | EXTENT_DEFRAG | EXTENT_CLEAR_META_RESV
+ * are also handled by the cleanup function.
*
- * However, in case of @keep_locked, we still need to unlock the pages
- * (except @locked_folio) to ensure all the pages are unlocked.
+ * So here we only clear EXTENT_LOCKED and EXTENT_DELALLOC flag,
+ * and finish the writeback of the involved folios, which will be
+ * never submitted.
*/
- if (keep_locked && orig_start < start) {
+ if (orig_start < start) {
+ clear_bits = EXTENT_LOCKED | EXTENT_DELALLOC;
+ page_ops = PAGE_UNLOCK | PAGE_START_WRITEBACK | PAGE_END_WRITEBACK;
+
if (!locked_folio)
mapping_set_error(inode->vfs_inode.i_mapping, ret);
extent_clear_unlock_delalloc(inode, orig_start, start - 1,
- locked_folio, NULL, 0, page_ops);
+ locked_folio, NULL, clear_bits, page_ops);
}
- /*
- * At this point we're unlocked, we want to make sure we're only
- * clearing these flags under the extent lock, so lock the rest of the
- * range and clear everything up.
- */
- lock_extent(&inode->io_tree, start, end, NULL);
+ clear_bits = EXTENT_LOCKED | EXTENT_DELALLOC | EXTENT_DELALLOC_NEW |
+ EXTENT_DEFRAG | EXTENT_CLEAR_META_RESV;
+ page_ops = PAGE_UNLOCK | PAGE_START_WRITEBACK | PAGE_END_WRITEBACK;
/*
* For the range (2). If we reserved an extent for our delalloc range
--
2.47.1
On Wed, Dec 04, 2024 at 05:29:26PM +0800, wzs wrote:
> Hello,
> when fuzzing the Linux kernel 6.7.0,
> the following crash was triggered.
>
> kernel config : https://pastebin.com/3JeQFdUr
> console output : https://pastebin.com/9ADtBQtP
>
> Basically, we use gadget module to simulate the connection and interaction
> process of a USB device
> (device type code : 0003, vendor id : 046D, product id : C312, serial
> number : 27B4, with function : input event).
>
> It seems to be caused by a mismatch between the uevent's environmental
> limit and the buffer size used to receive the uevent, which triggers such
> kernel warning.
>
> The crash report is as follow:
> 、、、
> [203835.102225] input: wingfuz Keyboard as
> /devices/platform/dummy_hcd.0/usb3/3-1/3-1:1.0/0003:046D:C312.27B4/input/input5893
> [203835.155527] ------------[ cut here ]------------
> [203835.155533] add_uevent_var: buffer size too small
> [203835.162092] WARNING: CPU: 11 PID: 57434 at lib/kobject_uevent.c:671
> add_uevent_var+0x2fe/0x390
I think this is already fixed in newer kernel versions. 6.7.0 is very
old and obsolete. Can you test this on 6.12.1?
thanks,
greg k-h
This fixes a couple of different problems, that can cause RTC (alarm)
irqs to be missing when generating UIE interrupts.
The first commit fixes a long-standing problem, which has been
documented in a comment since 2010. This fixes a race that could cause
UIE irqs to stop being generated, which was easily reproduced by
timing the use of RTC_UIE_ON ioctl with the seconds tick in the RTC.
The last commit ensures that RTC (alarm) irqs are enabled whenever
RTC_UIE_ON ioctl is used.
The driver specific commits avoids kernel warnings about unbalanced
enable_irq/disable_irq, which gets triggered on first RTC_UIE_ON with
the last commit. Before this series, the same warning should be seen
on initial RTC_AIE_ON with those drivers.
Signed-off-by: Esben Haabendal <esben(a)geanix.com>
---
Esben Haabendal (6):
rtc: interface: Fix long-standing race when setting alarm
rtc: isl12022: Fix initial enable_irq/disable_irq balance
rtc: cpcap: Fix initial enable_irq/disable_irq balance
rtc: st-lpc: Fix initial enable_irq/disable_irq balance
rtc: tps6586x: Fix initial enable_irq/disable_irq balance
rtc: interface: Ensure alarm irq is enabled when UIE is enabled
drivers/rtc/interface.c | 27 +++++++++++++++++++++++++++
drivers/rtc/rtc-cpcap.c | 1 +
drivers/rtc/rtc-isl12022.c | 1 +
drivers/rtc/rtc-st-lpc.c | 1 +
drivers/rtc/rtc-tps6586x.c | 1 +
5 files changed, 31 insertions(+)
---
base-commit: 40384c840ea1944d7c5a392e8975ed088ecf0b37
change-id: 20241203-rtc-uie-irq-fixes-f2838782d0f8
Best regards,
--
Esben Haabendal <esben(a)geanix.com>
The current requested response version(V1) for MANA_QUERY_GF_STAT query
results in STATISTICS_FLAGS_TX_ERRORS_GDMA_ERROR value being set to
0 always.
In order to get the correct value for this counter we request the response
version to be V2.
Cc: stable(a)vger.kernel.org
Fixes: e1df5202e879 ("net :mana :Add remaining GDMA stats for MANA to ethtool")
Signed-off-by: Shradha Gupta <shradhagupta(a)linux.microsoft.com>
---
drivers/net/ethernet/microsoft/mana/mana_en.c | 1 +
1 file changed, 1 insertion(+)
diff --git a/drivers/net/ethernet/microsoft/mana/mana_en.c b/drivers/net/ethernet/microsoft/mana/mana_en.c
index 57ac732e7707..f73848a4feb3 100644
--- a/drivers/net/ethernet/microsoft/mana/mana_en.c
+++ b/drivers/net/ethernet/microsoft/mana/mana_en.c
@@ -2536,6 +2536,7 @@ void mana_query_gf_stats(struct mana_port_context *apc)
mana_gd_init_req_hdr(&req.hdr, MANA_QUERY_GF_STAT,
sizeof(req), sizeof(resp));
+ req.hdr.resp.msg_version = GDMA_MESSAGE_V2;
req.req_stats = STATISTICS_FLAGS_RX_DISCARDS_NO_WQE |
STATISTICS_FLAGS_RX_ERRORS_VPORT_DISABLED |
STATISTICS_FLAGS_HC_RX_BYTES |
--
2.43.0
I think this patch should also be backported to the v6.6 LTS tree.
Since it should recolonize as Fixes: 8ee0b41898 ("riscv: signal:
Add sigcontext save/restore for vector") and that commit first
appears since v6.5-rc1 and this patch land to master branch since
v6.9-rc3
Thanks,
Yangyu Chen
On 4/3/24 15:26, Björn Töpel wrote:
> From: Björn Töpel <bjorn(a)rivosinc.com>
> The RISC-V Vector specification states in "Appendix D: Calling
> Convention for Vector State" [1] that "Executing a system call causes
> all caller-saved vector registers (v0-v31, vl, vtype) and vstart to
> become unspecified.". In the RISC-V kernel this is called "discarding
> the vstate".
> Returning from a signal handler via the rt_sigreturn() syscall, vector
> discard is also performed. However, this is not an issue since the
> vector state should be restored from the sigcontext, and therefore not
> care about the vector discard.
> The "live state" is the actual vector register in the running context,
> and the "vstate" is the vector state of the task. A dirty live state,
> means that the vstate and live state are not in synch.
> When vectorized user_from_copy() was introduced, an bug sneaked in at
> the restoration code, related to the discard of the live state.
> An example when this go wrong:
> 1. A userland application is executing vector code
> 2. The application receives a signal, and the signal handler is
> entered.
> 3. The application returns from the signal handler, using the
> rt_sigreturn() syscall.
> 4. The live vector state is discarded upon entering the
> rt_sigreturn(), and the live state is marked as "dirty", indicating
> that the live state need to be synchronized with the current
> vstate.
> 5. rt_sigreturn() restores the vstate, except the Vector registers,
> from the sigcontext
> 6. rt_sigreturn() restores the Vector registers, from the sigcontext,
> and now the vectorized user_from_copy() is used. The dirty live
> state from the discard is saved to the vstate, making the vstate
> corrupt.
> 7. rt_sigreturn() returns to the application, which crashes due to
> corrupted vstate.
> Note that the vectorized user_from_copy() is invoked depending on the
> value of CONFIG_RISCV_ISA_V_UCOPY_THRESHOLD. Default is 768, which
> means that vlen has to be larger than 128b for this bug to trigger.
> The fix is simply to mark the live state as non-dirty/clean prior
> performing the vstate restore.
> Link: https://github.com/riscv/riscv-isa-manual/releases/download/riscv-isa-relea… # [1]
> Reported-by: Charlie Jenkins <charlie(a)rivosinc.com>
> Reported-by: Vineet Gupta <vgupta(a)kernel.org>
> Fixes: c2a658d41924 ("riscv: lib: vectorize copy_to_user/copy_from_user")
> Signed-off-by: Björn Töpel <bjorn(a)rivosinc.com>
> ---
> arch/riscv/kernel/signal.c | 15 ++++++++-------
> 1 file changed, 8 insertions(+), 7 deletions(-)
> diff --git a/arch/riscv/kernel/signal.c b/arch/riscv/kernel/signal.c
> index 501e66debf69..5a2edd7f027e 100644
> --- a/arch/riscv/kernel/signal.c
> +++ b/arch/riscv/kernel/signal.c
> @@ -119,6 +119,13 @@ static long __restore_v_state(struct pt_regs *regs, void __user *sc_vec)
> struct __sc_riscv_v_state __user *state = sc_vec;
> void __user *datap;
> + /*
> + * Mark the vstate as clean prior performing the actual copy,
> + * to avoid getting the vstate incorrectly clobbered by the
> + * discarded vector state.
> + */
> + riscv_v_vstate_set_restore(current, regs);
> +
> /* Copy everything of __sc_riscv_v_state except datap. */
> err = __copy_from_user(¤t->thread.vstate, &state->v_state,
> offsetof(struct __riscv_v_ext_state, datap));
> @@ -133,13 +140,7 @@ static long __restore_v_state(struct pt_regs *regs, void __user *sc_vec)
> * Copy the whole vector content from user space datap. Use
> * copy_from_user to prevent information leak.
> */
> - err = copy_from_user(current->thread.vstate.datap, datap, riscv_v_vsize);
> - if (unlikely(err))
> - return err;
> -
> - riscv_v_vstate_set_restore(current, regs);
> -
> - return err;
> + return copy_from_user(current->thread.vstate.datap, datap, riscv_v_vsize);
> }
> #else
> #define save_v_state(task, regs) (0)
> base-commit: 7115ff4a8bfed3b9294bad2e111744e6abeadf1a
With the new __counted_by annocation in cfg80211_scan_request struct,
the "n_channels" struct member must be set before accessing the
"channels" array. Failing to do so will trigger a runtime warning
when enabling CONFIG_UBSAN_BOUNDS and CONFIG_FORTIFY_SOURCE.
Fixes: e3eac9f32ec0 ("wifi: cfg80211: Annotate struct cfg80211_scan_request with __counted_by")
Signed-off-by: Haoyu Li <lihaoyu499(a)gmail.com>
---
net/wireless/sme.c | 1 +
1 file changed, 1 insertion(+)
diff --git a/net/wireless/sme.c b/net/wireless/sme.c
index 431da30817a6..268171600087 100644
--- a/net/wireless/sme.c
+++ b/net/wireless/sme.c
@@ -83,6 +83,7 @@ static int cfg80211_conn_scan(struct wireless_dev *wdev)
if (!request)
return -ENOMEM;
+ request->n_channels = n_channels;
if (wdev->conn->params.channel) {
enum nl80211_band band = wdev->conn->params.channel->band;
struct ieee80211_supported_band *sband =
--
2.34.1
Make napi_hash_lock IRQ safe. It is used during the control path, and is
taken and released in napi_hash_add and napi_hash_del, which will
typically be called by calls to napi_enable and napi_disable.
This change avoids a deadlock in pcnet32 (and other any other drivers
which follow the same pattern):
CPU 0:
pcnet32_open
spin_lock_irqsave(&lp->lock, ...)
napi_enable
napi_hash_add <- before this executes, CPU 1 proceeds
spin_lock(napi_hash_lock)
[...]
spin_unlock_irqrestore(&lp->lock, flags);
CPU 1:
pcnet32_close
napi_disable
napi_hash_del
spin_lock(napi_hash_lock)
< INTERRUPT >
pcnet32_interrupt
spin_lock(lp->lock) <- DEADLOCK
Changing the napi_hash_lock to be IRQ safe prevents the IRQ from firing
on CPU 1 until napi_hash_lock is released, preventing the deadlock.
Cc: stable(a)vger.kernel.org
Fixes: 86e25f40aa1e ("net: napi: Add napi_config")
Reported-by: Guenter Roeck <linux(a)roeck-us.net>
Closes: https://lore.kernel.org/netdev/85dd4590-ea6b-427d-876a-1d8559c7ad82@roeck-u…
Suggested-by: Jakub Kicinski <kuba(a)kernel.org>
Signed-off-by: Joe Damato <jdamato(a)fastly.com>
---
net/core/dev.c | 18 ++++++++++++------
1 file changed, 12 insertions(+), 6 deletions(-)
diff --git a/net/core/dev.c b/net/core/dev.c
index 13d00fc10f55..45a8c3dd4a64 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -6557,18 +6557,22 @@ static void __napi_hash_add_with_id(struct napi_struct *napi,
static void napi_hash_add_with_id(struct napi_struct *napi,
unsigned int napi_id)
{
- spin_lock(&napi_hash_lock);
+ unsigned long flags;
+
+ spin_lock_irqsave(&napi_hash_lock, flags);
WARN_ON_ONCE(napi_by_id(napi_id));
__napi_hash_add_with_id(napi, napi_id);
- spin_unlock(&napi_hash_lock);
+ spin_unlock_irqrestore(&napi_hash_lock, flags);
}
static void napi_hash_add(struct napi_struct *napi)
{
+ unsigned long flags;
+
if (test_bit(NAPI_STATE_NO_BUSY_POLL, &napi->state))
return;
- spin_lock(&napi_hash_lock);
+ spin_lock_irqsave(&napi_hash_lock, flags);
/* 0..NR_CPUS range is reserved for sender_cpu use */
do {
@@ -6578,7 +6582,7 @@ static void napi_hash_add(struct napi_struct *napi)
__napi_hash_add_with_id(napi, napi_gen_id);
- spin_unlock(&napi_hash_lock);
+ spin_unlock_irqrestore(&napi_hash_lock, flags);
}
/* Warning : caller is responsible to make sure rcu grace period
@@ -6586,11 +6590,13 @@ static void napi_hash_add(struct napi_struct *napi)
*/
static void napi_hash_del(struct napi_struct *napi)
{
- spin_lock(&napi_hash_lock);
+ unsigned long flags;
+
+ spin_lock_irqsave(&napi_hash_lock, flags);
hlist_del_init_rcu(&napi->napi_hash_node);
- spin_unlock(&napi_hash_lock);
+ spin_unlock_irqrestore(&napi_hash_lock, flags);
}
static enum hrtimer_restart napi_watchdog(struct hrtimer *timer)
--
2.25.1