[BUG]
During development of a minor feature (make sure all btrfs_bio::end_io()
is called in task context), I noticed a crash in generic/388, where
metadata writes triggered new works after btrfs_stop_all_workers().
It turns out that it can even happen without any code modification, just
using RAID5 for metadata and the same workload from generic/388 is going
to trigger the use-after-free.
[CAUSE]
If btrfs hits an error, the fs is marked as error, no new
transaction is allowed thus metadata is in a frozen state.
But there are some metadata modification before that error, and they are
still in the btree inode page cache.
Since there will be no real transaction commitment, all those dirty
folios are just kept as is in the page cache, and they can not be
invalidated by invalidate_inode_pages2() call inside close_ctree(),
because they are dirty.
And finally after btrfs_stop_all_workers(), we call iput() on btree
inode, which triggers writeback of those dirty metadata.
And if the fs is using RAID56 metadata, this will trigger RMW and queue
new works into rmw_workers, which is already stopped, causing warning
from queue_work() and use-after-free.
[FIX]
Add a special handling for write_one_eb(), that if the fs is already in
an error state immediately mark the bbio as failure, instead of really
submitting them into the device.
This means for test case like generic/388, at iput() those dirty folios
will just be discarded without triggering IO.
CC: stable(a)vger.kernel.org
Signed-off-by: Qu Wenruo <wqu(a)suse.com>
---
fs/btrfs/disk-io.c | 12 ++++++++++++
fs/btrfs/extent_io.c | 11 +++++++++++
2 files changed, 23 insertions(+)
diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 0aa7e5d1b05f..8b0fc2df85f1 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -4402,11 +4402,23 @@ void __cold close_ctree(struct btrfs_fs_info *fs_info)
btrfs_put_block_group_cache(fs_info);
+ /*
+ * If the fs is already in trans aborted case, also trigger writeback of all
+ * dirty metadata folios.
+ * Those folios will not reach disk but dicarded directly.
+ * This is to make sure no dirty folios before iput(), or iput() will
+ * trigger writeback again, and may even cause extra works queued
+ * into workqueue.
+ */
+ if (unlikely(BTRFS_FS_ERROR(fs_info)))
+ filemap_write_and_wait(fs_info->btree_inode->i_mapping);
+
/*
* we must make sure there is not any read request to
* submit after we stopping all workers.
*/
invalidate_inode_pages2(fs_info->btree_inode->i_mapping);
+
btrfs_stop_all_workers(fs_info);
/* We shouldn't have any transaction open at this point */
diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index 870584dde575..a8a53409bb3f 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -2246,6 +2246,17 @@ static noinline_for_stack void write_one_eb(struct extent_buffer *eb,
wbc_account_cgroup_owner(wbc, folio, range_len);
folio_unlock(folio);
}
+ /*
+ * If the fs is already in error status, do not submit any writeback
+ * but immediately finish it.
+ * This is to avoid iput() triggering dirty folio writeback for
+ * transaction aborted fses, which can cause extra works into
+ * already stopped workqueues.
+ */
+ if (unlikely(BTRFS_FS_ERROR(fs_info))) {
+ btrfs_bio_end_io(bbio, errno_to_blk_status(-EROFS));
+ return;
+ }
btrfs_submit_bbio(bbio, 0);
}
--
2.51.0
[BUG]
During development of a minor feature (make sure all btrfs_bio::end_io()
is called in task context), I noticed a crash in generic/388, where
metadata writes triggered new works after btrfs_stop_all_workers().
It turns out that it can even happen without any code modification, just
using RAID5 for metadata and the same workload from generic/388 is going
to trigger the use-after-free.
[CAUSE]
If btrfs hits an error, the fs is marked as error, no new
transaction is allowed thus metadata is in a frozen state.
But there are some metadata modifications before that error, and they are
still in the btree inode page cache.
Since there will be no real transaction commit, all those dirty folios
are just kept as is in the page cache, and they can not be invalidated
by invalidate_inode_pages2() call inside close_ctree(), because they are
dirty.
And finally after btrfs_stop_all_workers(), we call iput() on btree
inode, which triggers writeback of those dirty metadata.
And if the fs is using RAID56 metadata, this will trigger RMW and queue
new works into rmw_workers, which is already stopped, causing warning
from queue_work() and use-after-free.
[FIX]
Add a special handling for write_one_eb(), that if the fs is already in
an error state, immediately mark the bbio as failure, instead of really
submitting them.
Then during close_ctree(), iput() will just discard all those dirty
tree blocks without really writing them back, thus no more new jobs for
already stopped-and-freed workqueues.
The extra discard in write_one_eb() also acts as an extra safenet.
E.g. the transaction abort is triggered by some extent/free space
tree corruptions, and since extent/free space tree is already corrupted
some tree blocks may be allocated where they shouldn't be (overwriting
existing tree blocks). In that case writing them back will further
corrupting the fs.
CC: stable(a)vger.kernel.org #6.6
Signed-off-by: Qu Wenruo <wqu(a)suse.com>
---
Changelog
v2:
- Various grammar and newline fixes
- A shorter title line
- Enhance the [FIX] part, explain the full fix
- Limit the backport for 6.6
v6.1 code base is very different compared to the current one, thus
backporting to v6.6 would be the limit.
- Explain more why discarding bios at write_one_eb() is safer
- Remove the extra flushing part inside close_ctree()
There is no difference flushing the dirty folios manually or by
iput(), as dirty folios are discarded anyway, no new job will be
created.
---
fs/btrfs/extent_io.c | 8 ++++++++
1 file changed, 8 insertions(+)
diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index 870584dde575..8f6b8baba003 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -2246,6 +2246,14 @@ static noinline_for_stack void write_one_eb(struct extent_buffer *eb,
wbc_account_cgroup_owner(wbc, folio, range_len);
folio_unlock(folio);
}
+ /*
+ * If the fs is already in error status, do not submit any writeback
+ * but immediately finish it.
+ */
+ if (unlikely(BTRFS_FS_ERROR(fs_info))) {
+ btrfs_bio_end_io(bbio, errno_to_blk_status(-EROFS));
+ return;
+ }
btrfs_submit_bbio(bbio, 0);
}
--
2.51.0
An invalid pointer dereference bug was reported on arm64 cpu, and has
not yet been seen on x86. A partial oops looks like:
Call trace:
update_cfs_rq_h_load+0x80/0xb0
wake_affine+0x158/0x168
select_task_rq_fair+0x364/0x3a8
try_to_wake_up+0x154/0x648
wake_up_q+0x68/0xd0
futex_wake_op+0x280/0x4c8
do_futex+0x198/0x1c0
__arm64_sys_futex+0x11c/0x198
Link: https://lore.kernel.org/all/20251013071820.1531295-1-CruzZhao@linux.alibaba…
We found that the task_group corresponding to the problematic se
is not in the parent task_group’s children list, indicating that
h_load_next points to an invalid address. Consider the following
cgroup and task hierarchy:
A
/ \
/ \
B E
/ \ |
/ \ t2
C D
| |
t0 t1
Here follows a timing sequence that may be responsible for triggering
the problem:
CPU X CPU Y CPU Z
wakeup t0
set list A->B->C
traverse A->B->C
t0 exits
destroy C
wakeup t2
set list A->E wakeup t1
set list A->B->D
traverse A->B->C
panic
CPU Z sets ->h_load_next list to A->B->D, but due to arm64 weaker memory
ordering, Y may observe A->B before it sees B->D, then in this time window,
it can traverse A->B->C and reach an invalid se.
We can avoid stale pointer accesses by clearing ->h_load_next when
unregistering cgroup.
Suggested-by: Vincent Guittot <vincent.guittot(a)linaro.org>
Fixes: 685207963be9 ("sched: Move h_load calculation to task_h_load()")
Cc: <stable(a)vger.kernel.org>
Co-developed-by: Cruz Zhao <CruzZhao(a)linux.alibaba.com>
Signed-off-by: Cruz Zhao <CruzZhao(a)linux.alibaba.com>
Signed-off-by: Peng Wang <peng_wang(a)linux.alibaba.com>
---
kernel/sched/fair.c | 8 ++++++++
1 file changed, 8 insertions(+)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index cee1793e8277..a5fce15093d3 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -13427,6 +13427,14 @@ void unregister_fair_sched_group(struct task_group *tg)
list_del_leaf_cfs_rq(cfs_rq);
}
remove_entity_load_avg(se);
+ /*
+ * Clear parent's h_load_next if it points to the
+ * sched_entity being freed to avoid stale pointer.
+ */
+ struct cfs_rq *parent_cfs_rq = cfs_rq_of(se);
+
+ if (READ_ONCE(parent_cfs_rq->h_load_next) == se)
+ WRITE_ONCE(parent_cfs_rq->h_load_next, NULL);
}
/*
--
2.27.0
On LoongArch system, guest PMU hardware is shared by guest and host
and PMU interrupt is separated. PMU is pass-through to VM, and there is
PMU context switch when exit to host and return to guest.
There is optimiation to check whether PMU is enabled by guest. If not,
it is not necessary to return to guest. However it is enabled, PMU
context for guest need switch on. Now KVM_REQ_PMU notification is set
on vcpu context switch, however it is missing if there is no vcpu context
switch and PMU is used by guest VM.
Fixes: f4e40ea9f78f ("LoongArch: KVM: Add PMU support for guest")
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Bibo Mao <maobibo(a)loongson.cn>
---
arch/loongarch/kvm/vcpu.c | 5 +++++
1 file changed, 5 insertions(+)
diff --git a/arch/loongarch/kvm/vcpu.c b/arch/loongarch/kvm/vcpu.c
index 30e3b089a596..bf56ad29ac15 100644
--- a/arch/loongarch/kvm/vcpu.c
+++ b/arch/loongarch/kvm/vcpu.c
@@ -132,6 +132,9 @@ static void kvm_lose_pmu(struct kvm_vcpu *vcpu)
* Clear KVM_LARCH_PMU if the guest is not using PMU CSRs when
* exiting the guest, so that the next time trap into the guest.
* We don't need to deal with PMU CSRs contexts.
+ *
+ * Otherwise set request bit KVM_REQ_PMU to restore guest PMU
+ * before entering guest VM
*/
val = kvm_read_sw_gcsr(csr, LOONGARCH_CSR_PERFCTRL0);
val |= kvm_read_sw_gcsr(csr, LOONGARCH_CSR_PERFCTRL1);
@@ -139,6 +142,8 @@ static void kvm_lose_pmu(struct kvm_vcpu *vcpu)
val |= kvm_read_sw_gcsr(csr, LOONGARCH_CSR_PERFCTRL3);
if (!(val & KVM_PMU_EVENT_ENABLED))
vcpu->arch.aux_inuse &= ~KVM_LARCH_PMU;
+ else
+ kvm_make_request(KVM_REQ_PMU, vcpu);
kvm_restore_host_pmu(vcpu);
}
--
2.39.3
A few cleanups and a bugfix that are either suitable after the swap
table phase I or found during code review.
Patch 1 is a bugfix and needs to be included in the stable branch,
the rest have no behavior change.
---
Kairui Song (4):
mm, swap: do not perform synchronous discard during allocation
mm, swap: rename helper for setup bad slots
mm, swap: cleanup swap entry allocation parameter
mm/migrate, swap: drop usage of folio_index
include/linux/swap.h | 4 ++--
mm/migrate.c | 4 ++--
mm/shmem.c | 2 +-
mm/swap.h | 21 -----------------
mm/swapfile.c | 64 ++++++++++++++++++++++++++++++++++++----------------
mm/vmscan.c | 4 ++--
6 files changed, 52 insertions(+), 47 deletions(-)
---
base-commit: 53e573001f2b5168f9b65d2b79e9563a3b479c17
change-id: 20251007-swap-clean-after-swap-table-p1-b9a7635ee3fa
Best regards,
--
Kairui Song <kasong(a)tencent.com>
When waiting for the PCIe link to come up, both link up and link down
are valid results depending on the device state.
Since the link may come up later and to get rid of the following
mis-reported PM errors. Do not return an -ETIMEDOUT error, as the
outcome has already been reported in dw_pcie_wait_for_link().
PM error logs introduced by the -ETIMEDOUT error return.
imx6q-pcie 33800000.pcie: Phy link never came up
imx6q-pcie 33800000.pcie: PM: dpm_run_callback(): genpd_resume_noirq returns -110
imx6q-pcie 33800000.pcie: PM: failed to resume noirq: error -110
Cc: stable(a)vger.kernel.org
Fixes: 4774faf854f5 ("PCI: dwc: Implement generic suspend/resume functionality")
Signed-off-by: Richard Zhu <hongxing.zhu(a)nxp.com>
Reviewed-by: Frank Li <Frank.Li(a)nxp.com>
---
drivers/pci/controller/dwc/pcie-designware-host.c | 7 +++----
1 file changed, 3 insertions(+), 4 deletions(-)
diff --git a/drivers/pci/controller/dwc/pcie-designware-host.c b/drivers/pci/controller/dwc/pcie-designware-host.c
index a4d9838bc33f0..8430ac433d457 100644
--- a/drivers/pci/controller/dwc/pcie-designware-host.c
+++ b/drivers/pci/controller/dwc/pcie-designware-host.c
@@ -1212,10 +1212,9 @@ int dw_pcie_resume_noirq(struct dw_pcie *pci)
if (ret)
return ret;
- ret = dw_pcie_wait_for_link(pci);
- if (ret)
- return ret;
+ /* Ignore errors, the link may come up later */
+ dw_pcie_wait_for_link(pci);
- return ret;
+ return 0;
}
EXPORT_SYMBOL_GPL(dw_pcie_resume_noirq);
--
2.37.1
A chip freeze is observed on i.MX7D when PCIe RC kicks off the PM_PME
message and no devices are connected on the port.
To work aroud such kind of issue, skip PME_Turn_Off message if there is
no endpoint connected.
Cc: stable(a)vger.kernel.org
Fixes: 4774faf854f5 ("PCI: dwc: Implement generic suspend/resume functionality")
Fixes: a528d1a72597 ("PCI: imx6: Use DWC common suspend resume method")
Signed-off-by: Richard Zhu <hongxing.zhu(a)nxp.com>
Reviewed-by: Frank Li <Frank.Li(a)nxp.com>
---
drivers/pci/controller/dwc/pcie-designware-host.c | 15 +++++++++------
1 file changed, 9 insertions(+), 6 deletions(-)
diff --git a/drivers/pci/controller/dwc/pcie-designware-host.c b/drivers/pci/controller/dwc/pcie-designware-host.c
index 09b50a5ce19bb..a4d9838bc33f0 100644
--- a/drivers/pci/controller/dwc/pcie-designware-host.c
+++ b/drivers/pci/controller/dwc/pcie-designware-host.c
@@ -1136,12 +1136,15 @@ int dw_pcie_suspend_noirq(struct dw_pcie *pci)
if (dw_pcie_readw_dbi(pci, offset + PCI_EXP_LNKCTL) & PCI_EXP_LNKCTL_ASPM_L1)
return 0;
- if (pci->pp.ops->pme_turn_off) {
- pci->pp.ops->pme_turn_off(&pci->pp);
- } else {
- ret = dw_pcie_pme_turn_off(pci);
- if (ret)
- return ret;
+ /* Skip PME_Turn_Off message if there is no endpoint connected */
+ if (dw_pcie_get_ltssm(pci) > DW_PCIE_LTSSM_DETECT_WAIT) {
+ if (pci->pp.ops->pme_turn_off) {
+ pci->pp.ops->pme_turn_off(&pci->pp);
+ } else {
+ ret = dw_pcie_pme_turn_off(pci);
+ if (ret)
+ return ret;
+ }
}
if (dwc_quirk(pci, QUIRK_NOL2POLL_IN_PM)) {
--
2.37.1