The process_madvise() system call returns error even after processing
some VMA's passed in the 'struct iovec' vector list which leaves the
user confused to know where to restart the advise next. It is also
against this syscall man page[1] documentation where it mentions that
"return value may be less than the total number of requested bytes, if
an error occurred after some iovec elements were already processed.".
Consider a user passed 10 VMA's in the 'struct iovec' vector list of
which 9 are processed but one. Then it just returns the error caused on
that failed VMA despite the first 9 VMA's processed, leaving the user
confused about on which VMA it is failed. Returning the number of bytes
processed here can help the user to know which VMA it is failed on and
thus can retry/skip the advise on that VMA.
[1]https://man7.org/linux/man-pages/man2/process_madvise.2.html.
Fixes: ecb8ac8b1f14("mm/madvise: introduce process_madvise() syscall: an external memory hinting API")
Cc: <stable(a)vger.kernel.org> # 5.10+
Signed-off-by: Charan Teja Kalla <quic_charante(a)quicinc.com>
---
Changes in V2:
-- Separated the ENOMEM handling and return bytes processed, as per Minchan comments.
-- This contains correcting return bytes processed with process_madvise().
Changes in V1:
-- Fixed the ENOMEM handling and return bytes processed by process_madvise.
-- https://patchwork.kernel.org/project/linux-mm/patch/1646803679-11433-1-git-…
mm/madvise.c | 3 +--
1 file changed, 1 insertion(+), 2 deletions(-)
diff --git a/mm/madvise.c b/mm/madvise.c
index 38d0f51..e97e6a9 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -1433,8 +1433,7 @@ SYSCALL_DEFINE5(process_madvise, int, pidfd, const struct iovec __user *, vec,
iov_iter_advance(&iter, iovec.iov_len);
}
- if (ret == 0)
- ret = total_len - iov_iter_count(&iter);
+ ret = (total_len - iov_iter_count(&iter)) ? : ret;
release_mm:
mmput(mm);
--
2.7.4
This is the start of the stable review cycle for the 5.4.185 release.
There are 43 patches in this series, all will be posted as a response
to this one. If anyone has any issues with these being applied, please
let me know.
Responses should be made by Wed, 16 Mar 2022 11:27:22 +0000.
Anything received after that time might be too late.
The whole patch series can be found in one patch at:
https://www.kernel.org/pub/linux/kernel/v5.x/stable-review/patch-5.4.185-rc…
or in the git tree and branch at:
git://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable-rc.git linux-5.4.y
and the diffstat can be found below.
thanks,
greg k-h
-------------
Pseudo-Shortlog of commits:
Greg Kroah-Hartman <gregkh(a)linuxfoundation.org>
Linux 5.4.185-rc1
Krish Sadhukhan <krish.sadhukhan(a)oracle.com>
KVM: SVM: Don't flush cache if hardware enforces cache coherency across encryption domains
Krish Sadhukhan <krish.sadhukhan(a)oracle.com>
x86/mm/pat: Don't flush cache if hardware enforces cache coherency across encryption domnains
Krish Sadhukhan <krish.sadhukhan(a)oracle.com>
x86/cpu: Add hardware-enforced cache coherency as a CPUID feature
Borislav Petkov <bp(a)suse.de>
x86/cpufeatures: Mark two free bits in word 3
Josh Triplett <josh(a)joshtriplett.org>
ext4: add check to prevent attempting to resize an fs with sparse_super2
Russell King (Oracle) <rmk+kernel(a)armlinux.org.uk>
ARM: fix Thumb2 regression with Spectre BHB
Michael S. Tsirkin <mst(a)redhat.com>
virtio: acknowledge all features before access
Michael S. Tsirkin <mst(a)redhat.com>
virtio: unexport virtio_finalize_features
Pali Rohár <pali(a)kernel.org>
arm64: dts: marvell: armada-37xx: Remap IO space to bus address 0x0
Emil Renner Berthing <kernel(a)esmil.dk>
riscv: Fix auipc+jalr relocation range checks
Rong Chen <rong.chen(a)amlogic.com>
mmc: meson: Fix usage of meson_mmc_post_req()
Robert Hancock <robert.hancock(a)calian.com>
net: macb: Fix lost RX packet wakeup race in NAPI receive
Dan Carpenter <dan.carpenter(a)oracle.com>
staging: gdm724x: fix use after free in gdm_lte_rx()
Miklos Szeredi <mszeredi(a)redhat.com>
fuse: fix pipe buffer lifetime for direct_io
Randy Dunlap <rdunlap(a)infradead.org>
ARM: Spectre-BHB: provide empty stub for non-config
Mike Kravetz <mike.kravetz(a)oracle.com>
selftests/memfd: clean up mapping in mfd_fail_write
Aneesh Kumar K.V <aneesh.kumar(a)linux.ibm.com>
selftest/vm: fix map_fixed_noreplace test failure
Sven Schnelle <svens(a)linux.ibm.com>
tracing: Ensure trace buffer is at least 4096 bytes large
Niels Dossche <dossche.niels(a)gmail.com>
ipv6: prevent a possible race condition with lifetimes
Marek Marczykowski-Górecki <marmarek(a)invisiblethingslab.com>
Revert "xen-netback: Check for hotplug-status existence before watching"
Marek Marczykowski-Górecki <marmarek(a)invisiblethingslab.com>
Revert "xen-netback: remove 'hotplug-status' once it has served its purpose"
suresh kumar <suresh2514(a)gmail.com>
net-sysfs: add check for netdevice being present to speed_show
Kumar Kartikeya Dwivedi <memxor(a)gmail.com>
selftests/bpf: Add test for bpf_timer overwriting crash
Jeremy Linton <jeremy.linton(a)arm.com>
net: bcmgenet: Don't claim WOL when its not available
Eric Dumazet <edumazet(a)google.com>
sctp: fix kernel-infoleak for SCTP sockets
Clément Léger <clement.leger(a)bootlin.com>
net: phy: DP83822: clear MISR2 register to disable interrupts
Miaoqian Lin <linmq006(a)gmail.com>
gianfar: ethtool: Fix refcount leak in gfar_get_ts_info
Mark Featherston <mark(a)embeddedTS.com>
gpio: ts4900: Do not set DAT and OE together
Guillaume Nault <gnault(a)redhat.com>
selftests: pmtu.sh: Kill tcpdump processes launched by subshell.
Pavel Skripkin <paskripkin(a)gmail.com>
NFC: port100: fix use-after-free in port100_send_complete
Moshe Shemesh <moshe(a)nvidia.com>
net/mlx5: Fix a race on command flush flow
Mohammad Kabat <mohammadkab(a)nvidia.com>
net/mlx5: Fix size field in bufferx_reg struct
Duoming Zhou <duoming(a)zju.edu.cn>
ax25: Fix NULL pointer dereference in ax25_kill_by_device
Jiasheng Jiang <jiasheng(a)iscas.ac.cn>
net: ethernet: lpc_eth: Handle error for clk_enable
Jiasheng Jiang <jiasheng(a)iscas.ac.cn>
net: ethernet: ti: cpts: Handle error for clk_enable
Miaoqian Lin <linmq006(a)gmail.com>
ethernet: Fix error handling in xemaclite_of_probe
Joel Stanley <joel(a)jms.id.au>
ARM: dts: aspeed: Fix AST2600 quad spi group
Jernej Skrabec <jernej.skrabec(a)gmail.com>
drm/sun4i: mixer: Fix P010 and P210 format numbers
Tom Rix <trix(a)redhat.com>
qed: return status of qed_iov_get_link
Jia-Ju Bai <baijiaju1990(a)gmail.com>
net: qlogic: check the return value of dma_alloc_coherent() in qed_vf_hw_prepare()
Xie Yongji <xieyongji(a)bytedance.com>
virtio-blk: Don't use MAX_DISCARD_SEGMENTS if max_discard_seg is zero
Pali Rohár <pali(a)kernel.org>
arm64: dts: armada-3720-turris-mox: Add missing ethernet0 alias
Taniya Das <tdas(a)codeaurora.org>
clk: qcom: gdsc: Add support to update GDSC transition delay
-------------
Diffstat:
Makefile | 4 +-
arch/arm/boot/dts/aspeed-g6-pinctrl.dtsi | 2 +-
arch/arm/include/asm/spectre.h | 6 +++
arch/arm/kernel/entry-armv.S | 4 +-
.../boot/dts/marvell/armada-3720-turris-mox.dts | 8 +++-
arch/arm64/boot/dts/marvell/armada-37xx.dtsi | 2 +-
arch/riscv/kernel/module.c | 21 +++++++--
arch/x86/include/asm/cpufeatures.h | 2 +
arch/x86/kernel/cpu/scattered.c | 1 +
arch/x86/kvm/svm.c | 3 +-
arch/x86/mm/pageattr.c | 2 +-
drivers/block/virtio_blk.c | 10 +++-
drivers/clk/qcom/gdsc.c | 26 +++++++++--
drivers/clk/qcom/gdsc.h | 8 +++-
drivers/gpio/gpio-ts4900.c | 24 ++++++++--
drivers/gpu/drm/sun4i/sun8i_mixer.h | 8 ++--
drivers/mmc/host/meson-gx-mmc.c | 15 +++---
drivers/net/ethernet/broadcom/genet/bcmgenet_wol.c | 7 +++
drivers/net/ethernet/cadence/macb_main.c | 25 +++++++++-
drivers/net/ethernet/freescale/gianfar_ethtool.c | 1 +
drivers/net/ethernet/mellanox/mlx5/core/cmd.c | 15 +++---
drivers/net/ethernet/nxp/lpc_eth.c | 5 +-
drivers/net/ethernet/qlogic/qed/qed_sriov.c | 18 +++++---
drivers/net/ethernet/qlogic/qed/qed_vf.c | 7 +++
drivers/net/ethernet/ti/cpts.c | 4 +-
drivers/net/ethernet/xilinx/xilinx_emaclite.c | 4 +-
drivers/net/phy/dp83822.c | 2 +-
drivers/net/xen-netback/xenbus.c | 13 ++----
drivers/nfc/port100.c | 2 +
drivers/staging/gdm724x/gdm_lte.c | 5 +-
drivers/virtio/virtio.c | 39 ++++++++--------
fs/ext4/resize.c | 5 ++
fs/fuse/dev.c | 12 ++++-
fs/fuse/file.c | 1 +
fs/fuse/fuse_i.h | 1 +
include/linux/mlx5/mlx5_ifc.h | 4 +-
include/linux/virtio.h | 1 -
include/linux/virtio_config.h | 3 +-
kernel/trace/trace.c | 10 ++--
net/ax25/af_ax25.c | 7 +++
net/core/net-sysfs.c | 2 +-
net/ipv6/addrconf.c | 2 +
net/sctp/diag.c | 9 ++--
.../testing/selftests/bpf/prog_tests/timer_crash.c | 32 +++++++++++++
tools/testing/selftests/bpf/progs/timer_crash.c | 54 ++++++++++++++++++++++
tools/testing/selftests/memfd/memfd_test.c | 1 +
tools/testing/selftests/net/pmtu.sh | 7 ++-
tools/testing/selftests/vm/map_fixed_noreplace.c | 49 +++++++++++++++-----
48 files changed, 378 insertions(+), 115 deletions(-)
On Fri, Mar 18, 2022 at 02:42:49PM +0000, Geliang Tang wrote:
> Hi Greg,
>
> I got this bpf selftests build break today on the stable branch 5.10.106:
>
> =========================================================================
> CLNG-LLC [test_maps] test_tracepoint.o
> progs/timer_crash.c:8:19: error: field has incomplete type 'struct bpf_timer'
> struct bpf_timer timer;
> ^
> progs/timer_crash.c:8:9: note: forward declaration of 'struct bpf_timer'
> struct bpf_timer timer;
> ^
> progs/timer_crash.c:35:6: warning: implicit declaration of function 'bpf_get_current_task_btf' is invalid in C99 [-Wimplicit-function-declaration]
> if (bpf_get_current_task_btf()->tgid != pid)
> ^
> progs/timer_crash.c:35:34: error: member reference type 'int' is not a pointer
> if (bpf_get_current_task_btf()->tgid != pid)
> ~~~~~~~~~~~~~~~~~~~~~~~~~~ ^
> progs/timer_crash.c:49:3: warning: implicit declaration of function 'bpf_timer_cancel' is invalid in C99 [-Wimplicit-function-declaration]
> bpf_timer_cancel(&e->timer);
> ^
> 2 warnings and 2 errors generated.
> CLNG-LLC [test_maps] test_trace_ext_tracing.o
> llc: error: llc: <stdin>:1:1: error: expected top-level entity
> BPF obj compilation failed
> ^
> make: *** [Makefile:402: tools/testing/selftests/bpf/timer_crash.o] Error 1
> make: *** Waiting for unfinished jobs....
> CLNG-LLC [test_maps] test_trace_ext.o
> =========================================================================
>
> It is introduced by this commit, "selftests/bpf: Add test for bpf_timer
> overwriting crash". Since the commit "bpf: Introduce bpf timers." has not
> been merged into the stable branch yet.
>
> I am writing to you to report this bug.
>
Now reverted, thanks!
greg k-h
The patch below does not apply to the 5.10-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From ebe48d368e97d007bfeb76fcb065d6cfc4c96645 Mon Sep 17 00:00:00 2001
From: Steffen Klassert <steffen.klassert(a)secunet.com>
Date: Mon, 7 Mar 2022 13:11:39 +0100
Subject: [PATCH] esp: Fix possible buffer overflow in ESP transformation
The maximum message size that can be send is bigger than
the maximum site that skb_page_frag_refill can allocate.
So it is possible to write beyond the allocated buffer.
Fix this by doing a fallback to COW in that case.
v2:
Avoid get get_order() costs as suggested by Linus Torvalds.
Fixes: cac2661c53f3 ("esp4: Avoid skb_cow_data whenever possible")
Fixes: 03e2a30f6a27 ("esp6: Avoid skb_cow_data whenever possible")
Reported-by: valis <sec(a)valis.email>
Signed-off-by: Steffen Klassert <steffen.klassert(a)secunet.com>
diff --git a/include/net/esp.h b/include/net/esp.h
index 9c5637d41d95..90cd02ff77ef 100644
--- a/include/net/esp.h
+++ b/include/net/esp.h
@@ -4,6 +4,8 @@
#include <linux/skbuff.h>
+#define ESP_SKB_FRAG_MAXSIZE (PAGE_SIZE << SKB_FRAG_PAGE_ORDER)
+
struct ip_esp_hdr;
static inline struct ip_esp_hdr *ip_esp_hdr(const struct sk_buff *skb)
diff --git a/net/ipv4/esp4.c b/net/ipv4/esp4.c
index e1b1d080e908..70e6c87fbe3d 100644
--- a/net/ipv4/esp4.c
+++ b/net/ipv4/esp4.c
@@ -446,6 +446,7 @@ int esp_output_head(struct xfrm_state *x, struct sk_buff *skb, struct esp_info *
struct page *page;
struct sk_buff *trailer;
int tailen = esp->tailen;
+ unsigned int allocsz;
/* this is non-NULL only with TCP/UDP Encapsulation */
if (x->encap) {
@@ -455,6 +456,10 @@ int esp_output_head(struct xfrm_state *x, struct sk_buff *skb, struct esp_info *
return err;
}
+ allocsz = ALIGN(skb->data_len + tailen, L1_CACHE_BYTES);
+ if (allocsz > ESP_SKB_FRAG_MAXSIZE)
+ goto cow;
+
if (!skb_cloned(skb)) {
if (tailen <= skb_tailroom(skb)) {
nfrags = 1;
diff --git a/net/ipv6/esp6.c b/net/ipv6/esp6.c
index 7591160edce1..b0ffbcd5432d 100644
--- a/net/ipv6/esp6.c
+++ b/net/ipv6/esp6.c
@@ -482,6 +482,7 @@ int esp6_output_head(struct xfrm_state *x, struct sk_buff *skb, struct esp_info
struct page *page;
struct sk_buff *trailer;
int tailen = esp->tailen;
+ unsigned int allocsz;
if (x->encap) {
int err = esp6_output_encap(x, skb, esp);
@@ -490,6 +491,10 @@ int esp6_output_head(struct xfrm_state *x, struct sk_buff *skb, struct esp_info
return err;
}
+ allocsz = ALIGN(skb->data_len + tailen, L1_CACHE_BYTES);
+ if (allocsz > ESP_SKB_FRAG_MAXSIZE)
+ goto cow;
+
if (!skb_cloned(skb)) {
if (tailen <= skb_tailroom(skb)) {
nfrags = 1;
Hi,
I would like to request the following patches to be included
into the stable 5.10 tree:
a049a30fc27c ("net: usb: Correct PHY handling of smsc95xx")
0bf3885324a8 ("net: usb: Correct reset handling of smsc95xx")
c70c453abcbf ("smsc95xx: Ignore -ENODEV errors when device is unplugged")
They are already present in 5.15 and 5.16 and they fix real issues
on 5.10 too. I have been running 5.10 with these 3 patches applied locally
and no reboot/disconnect errors are seen anymore. Alexander Stein also
sees an smsc95xx suspend/resume issue fixed in 5.10 with the series applied.
Thanks,
Fabio Estevam
From: Filipe Manana <fdmanana(a)suse.com>
Commit 40cdc509877bacb438213b83c7541c5e24a1d9ec upstream
After the recent changes made by commit c2e39305299f01 ("btrfs: clear
extent buffer uptodate when we fail to write it") and its followup fix,
commit 651740a5024117 ("btrfs: check WRITE_ERR when trying to read an
extent buffer"), we can now end up not cleaning up space reservations of
log tree extent buffers after a transaction abort happens, as well as not
cleaning up still dirty extent buffers.
This happens because if writeback for a log tree extent buffer failed,
then we have cleared the bit EXTENT_BUFFER_UPTODATE from the extent buffer
and we have also set the bit EXTENT_BUFFER_WRITE_ERR on it. Later on,
when trying to free the log tree with free_log_tree(), which iterates
over the tree, we can end up getting an -EIO error when trying to read
a node or a leaf, since read_extent_buffer_pages() returns -EIO if an
extent buffer does not have EXTENT_BUFFER_UPTODATE set and has the
EXTENT_BUFFER_WRITE_ERR bit set. Getting that -EIO means that we return
immediately as we can not iterate over the entire tree.
In that case we never update the reserved space for an extent buffer in
the respective block group and space_info object.
When this happens we get the following traces when unmounting the fs:
[174957.284509] BTRFS: error (device dm-0) in cleanup_transaction:1913: errno=-5 IO failure
[174957.286497] BTRFS: error (device dm-0) in free_log_tree:3420: errno=-5 IO failure
[174957.399379] ------------[ cut here ]------------
[174957.402497] WARNING: CPU: 2 PID: 3206883 at fs/btrfs/block-group.c:127 btrfs_put_block_group+0x77/0xb0 [btrfs]
[174957.407523] Modules linked in: btrfs overlay dm_zero (...)
[174957.424917] CPU: 2 PID: 3206883 Comm: umount Tainted: G W 5.16.0-rc5-btrfs-next-109 #1
[174957.426689] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.14.0-0-g155821a1990b-prebuilt.qemu.org 04/01/2014
[174957.428716] RIP: 0010:btrfs_put_block_group+0x77/0xb0 [btrfs]
[174957.429717] Code: 21 48 8b bd (...)
[174957.432867] RSP: 0018:ffffb70d41cffdd0 EFLAGS: 00010206
[174957.433632] RAX: 0000000000000001 RBX: ffff8b09c3848000 RCX: ffff8b0758edd1c8
[174957.434689] RDX: 0000000000000001 RSI: ffffffffc0b467e7 RDI: ffff8b0758edd000
[174957.436068] RBP: ffff8b0758edd000 R08: 0000000000000000 R09: 0000000000000000
[174957.437114] R10: 0000000000000246 R11: 0000000000000000 R12: ffff8b09c3848148
[174957.438140] R13: ffff8b09c3848198 R14: ffff8b0758edd188 R15: dead000000000100
[174957.439317] FS: 00007f328fb82800(0000) GS:ffff8b0a2d200000(0000) knlGS:0000000000000000
[174957.440402] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[174957.441164] CR2: 00007fff13563e98 CR3: 0000000404f4e005 CR4: 0000000000370ee0
[174957.442117] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[174957.443076] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[174957.443948] Call Trace:
[174957.444264] <TASK>
[174957.444538] btrfs_free_block_groups+0x255/0x3c0 [btrfs]
[174957.445238] close_ctree+0x301/0x357 [btrfs]
[174957.445803] ? call_rcu+0x16c/0x290
[174957.446250] generic_shutdown_super+0x74/0x120
[174957.446832] kill_anon_super+0x14/0x30
[174957.447305] btrfs_kill_super+0x12/0x20 [btrfs]
[174957.447890] deactivate_locked_super+0x31/0xa0
[174957.448440] cleanup_mnt+0x147/0x1c0
[174957.448888] task_work_run+0x5c/0xa0
[174957.449336] exit_to_user_mode_prepare+0x1e5/0x1f0
[174957.449934] syscall_exit_to_user_mode+0x16/0x40
[174957.450512] do_syscall_64+0x48/0xc0
[174957.450980] entry_SYSCALL_64_after_hwframe+0x44/0xae
[174957.451605] RIP: 0033:0x7f328fdc4a97
[174957.452059] Code: 03 0c 00 f7 (...)
[174957.454320] RSP: 002b:00007fff13564ec8 EFLAGS: 00000246 ORIG_RAX: 00000000000000a6
[174957.455262] RAX: 0000000000000000 RBX: 00007f328feea264 RCX: 00007f328fdc4a97
[174957.456131] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000560b8ae51dd0
[174957.457118] RBP: 0000560b8ae51ba0 R08: 0000000000000000 R09: 00007fff13563c40
[174957.458005] R10: 00007f328fe49fc0 R11: 0000000000000246 R12: 0000000000000000
[174957.459113] R13: 0000560b8ae51dd0 R14: 0000560b8ae51cb0 R15: 0000000000000000
[174957.460193] </TASK>
[174957.460534] irq event stamp: 0
[174957.461003] hardirqs last enabled at (0): [<0000000000000000>] 0x0
[174957.461947] hardirqs last disabled at (0): [<ffffffffb0e94214>] copy_process+0x934/0x2040
[174957.463147] softirqs last enabled at (0): [<ffffffffb0e94214>] copy_process+0x934/0x2040
[174957.465116] softirqs last disabled at (0): [<0000000000000000>] 0x0
[174957.466323] ---[ end trace bc7ee0c490bce3af ]---
[174957.467282] ------------[ cut here ]------------
[174957.468184] WARNING: CPU: 2 PID: 3206883 at fs/btrfs/block-group.c:3976 btrfs_free_block_groups+0x330/0x3c0 [btrfs]
[174957.470066] Modules linked in: btrfs overlay dm_zero (...)
[174957.483137] CPU: 2 PID: 3206883 Comm: umount Tainted: G W 5.16.0-rc5-btrfs-next-109 #1
[174957.484691] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.14.0-0-g155821a1990b-prebuilt.qemu.org 04/01/2014
[174957.486853] RIP: 0010:btrfs_free_block_groups+0x330/0x3c0 [btrfs]
[174957.488050] Code: 00 00 00 ad de (...)
[174957.491479] RSP: 0018:ffffb70d41cffde0 EFLAGS: 00010206
[174957.492520] RAX: ffff8b08d79310b0 RBX: ffff8b09c3848000 RCX: 0000000000000000
[174957.493868] RDX: 0000000000000001 RSI: fffff443055ee600 RDI: ffffffffb1131846
[174957.495183] RBP: ffff8b08d79310b0 R08: 0000000000000000 R09: 0000000000000000
[174957.496580] R10: 0000000000000001 R11: 0000000000000000 R12: ffff8b08d7931000
[174957.498027] R13: ffff8b09c38492b0 R14: dead000000000122 R15: dead000000000100
[174957.499438] FS: 00007f328fb82800(0000) GS:ffff8b0a2d200000(0000) knlGS:0000000000000000
[174957.500990] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[174957.502117] CR2: 00007fff13563e98 CR3: 0000000404f4e005 CR4: 0000000000370ee0
[174957.503513] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[174957.504864] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[174957.506167] Call Trace:
[174957.506654] <TASK>
[174957.507047] close_ctree+0x301/0x357 [btrfs]
[174957.507867] ? call_rcu+0x16c/0x290
[174957.508567] generic_shutdown_super+0x74/0x120
[174957.509447] kill_anon_super+0x14/0x30
[174957.510194] btrfs_kill_super+0x12/0x20 [btrfs]
[174957.511123] deactivate_locked_super+0x31/0xa0
[174957.511976] cleanup_mnt+0x147/0x1c0
[174957.512610] task_work_run+0x5c/0xa0
[174957.513309] exit_to_user_mode_prepare+0x1e5/0x1f0
[174957.514231] syscall_exit_to_user_mode+0x16/0x40
[174957.515069] do_syscall_64+0x48/0xc0
[174957.515718] entry_SYSCALL_64_after_hwframe+0x44/0xae
[174957.516688] RIP: 0033:0x7f328fdc4a97
[174957.517413] Code: 03 0c 00 f7 d8 (...)
[174957.521052] RSP: 002b:00007fff13564ec8 EFLAGS: 00000246 ORIG_RAX: 00000000000000a6
[174957.522514] RAX: 0000000000000000 RBX: 00007f328feea264 RCX: 00007f328fdc4a97
[174957.523950] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000560b8ae51dd0
[174957.525375] RBP: 0000560b8ae51ba0 R08: 0000000000000000 R09: 00007fff13563c40
[174957.526763] R10: 00007f328fe49fc0 R11: 0000000000000246 R12: 0000000000000000
[174957.528058] R13: 0000560b8ae51dd0 R14: 0000560b8ae51cb0 R15: 0000000000000000
[174957.529404] </TASK>
[174957.529843] irq event stamp: 0
[174957.530256] hardirqs last enabled at (0): [<0000000000000000>] 0x0
[174957.531061] hardirqs last disabled at (0): [<ffffffffb0e94214>] copy_process+0x934/0x2040
[174957.532075] softirqs last enabled at (0): [<ffffffffb0e94214>] copy_process+0x934/0x2040
[174957.533083] softirqs last disabled at (0): [<0000000000000000>] 0x0
[174957.533865] ---[ end trace bc7ee0c490bce3b0 ]---
[174957.534452] BTRFS info (device dm-0): space_info 4 has 1070841856 free, is not full
[174957.535404] BTRFS info (device dm-0): space_info total=1073741824, used=2785280, pinned=0, reserved=49152, may_use=0, readonly=65536 zone_unusable=0
[174957.537029] BTRFS info (device dm-0): global_block_rsv: size 0 reserved 0
[174957.537859] BTRFS info (device dm-0): trans_block_rsv: size 0 reserved 0
[174957.538697] BTRFS info (device dm-0): chunk_block_rsv: size 0 reserved 0
[174957.539552] BTRFS info (device dm-0): delayed_block_rsv: size 0 reserved 0
[174957.540403] BTRFS info (device dm-0): delayed_refs_rsv: size 0 reserved 0
This also means that in case we have log tree extent buffers that are
still dirty, we can end up not cleaning them up in case we find an
extent buffer with EXTENT_BUFFER_WRITE_ERR set on it, as in that case
we have no way for iterating over the rest of the tree.
This issue is very often triggered with test cases generic/475 and
generic/648 from fstests.
The issue could almost be fixed by iterating over the io tree attached to
each log root which keeps tracks of the range of allocated extent buffers,
log_root->dirty_log_pages, however that does not work and has some
inconveniences:
1) After we sync the log, we clear the range of the extent buffers from
the io tree, so we can't find them after writeback. We could keep the
ranges in the io tree, with a separate bit to signal they represent
extent buffers already written, but that means we need to hold into
more memory until the transaction commits.
How much more memory is used depends a lot on whether we are able to
allocate contiguous extent buffers on disk (and how often) for a log
tree - if we are able to, then a single extent state record can
represent multiple extent buffers, otherwise we need multiple extent
state record structures to track each extent buffer.
In fact, my earlier approach did that:
https://lore.kernel.org/linux-btrfs/3aae7c6728257c7ce2279d6660ee2797e5e34bb…
However that can cause a very significant negative impact on
performance, not only due to the extra memory usage but also because
we get a larger and deeper dirty_log_pages io tree.
We got a report that, on beefy machines at least, we can get such
performance drop with fsmark for example:
https://lore.kernel.org/linux-btrfs/20220117082426.GE32491@xsang-OptiPlex-9…
2) We would be doing it only to deal with an unexpected and exceptional
case, which is basically failure to read an extent buffer from disk
due to IO failures. On a healthy system we don't expect transaction
aborts to happen after all;
3) Instead of relying on iterating the log tree or tracking the ranges
of extent buffers in the dirty_log_pages io tree, using the radix
tree that tracks extent buffers (fs_info->buffer_radix) to find all
log tree extent buffers is not reliable either, because after writeback
of an extent buffer it can be evicted from memory by the release page
callback of the btree inode (btree_releasepage()).
Since there's no way to be able to properly cleanup a log tree without
being able to read its extent buffers from disk and without using more
memory to track the logical ranges of the allocated extent buffers do
the following:
1) When we fail to cleanup a log tree, setup a flag that indicates that
failure;
2) Trigger writeback of all log tree extent buffers that are still dirty,
and wait for the writeback to complete. This is just to cleanup their
state, page states, page leaks, etc;
3) When unmounting the fs, ignore if the number of bytes reserved in a
block group and in a space_info is not 0 if, and only if, we failed to
cleanup a log tree. Also ignore only for metadata block groups and the
metadata space_info object.
This is far from a perfect solution, but it serves to silence test
failures such as those from generic/475 and generic/648. However having
a non-zero value for the reserved bytes counters on unmount after a
transaction abort, is not such a terrible thing and it's completely
harmless, it does not affect the filesystem integrity in any way.
Signed-off-by: Filipe Manana <fdmanana(a)suse.com>
Signed-off-by: David Sterba <dsterba(a)suse.com>
Signed-off-by: Anand Jain <anand.jain(a)oracle.com>
---
Unrelated conflict fix in
fs/btrfs/ctree.h
fs/btrfs/block-group.c | 26 ++++++++++++++++++++++++--
fs/btrfs/ctree.h | 7 +++++++
fs/btrfs/tree-log.c | 23 +++++++++++++++++++++++
3 files changed, 54 insertions(+), 2 deletions(-)
diff --git a/fs/btrfs/block-group.c b/fs/btrfs/block-group.c
index 5edd07e0232d..e1c5c2114edf 100644
--- a/fs/btrfs/block-group.c
+++ b/fs/btrfs/block-group.c
@@ -123,7 +123,16 @@ void btrfs_put_block_group(struct btrfs_block_group *cache)
{
if (refcount_dec_and_test(&cache->refs)) {
WARN_ON(cache->pinned > 0);
- WARN_ON(cache->reserved > 0);
+ /*
+ * If there was a failure to cleanup a log tree, very likely due
+ * to an IO failure on a writeback attempt of one or more of its
+ * extent buffers, we could not do proper (and cheap) unaccounting
+ * of their reserved space, so don't warn on reserved > 0 in that
+ * case.
+ */
+ if (!(cache->flags & BTRFS_BLOCK_GROUP_METADATA) ||
+ !BTRFS_FS_LOG_CLEANUP_ERROR(cache->fs_info))
+ WARN_ON(cache->reserved > 0);
/*
* A block_group shouldn't be on the discard_list anymore.
@@ -3888,9 +3897,22 @@ int btrfs_free_block_groups(struct btrfs_fs_info *info)
* important and indicates a real bug if this happens.
*/
if (WARN_ON(space_info->bytes_pinned > 0 ||
- space_info->bytes_reserved > 0 ||
space_info->bytes_may_use > 0))
btrfs_dump_space_info(info, space_info, 0, 0);
+
+ /*
+ * If there was a failure to cleanup a log tree, very likely due
+ * to an IO failure on a writeback attempt of one or more of its
+ * extent buffers, we could not do proper (and cheap) unaccounting
+ * of their reserved space, so don't warn on bytes_reserved > 0 in
+ * that case.
+ */
+ if (!(space_info->flags & BTRFS_BLOCK_GROUP_METADATA) ||
+ !BTRFS_FS_LOG_CLEANUP_ERROR(info)) {
+ if (WARN_ON(space_info->bytes_reserved > 0))
+ btrfs_dump_space_info(info, space_info, 0, 0);
+ }
+
WARN_ON(space_info->reclaim_size > 0);
list_del(&space_info->list);
btrfs_sysfs_remove_space_info(space_info);
diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index e89f814cc8f5..21c44846b002 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -142,6 +142,9 @@ enum {
BTRFS_FS_STATE_DEV_REPLACING,
/* The btrfs_fs_info created for self-tests */
BTRFS_FS_STATE_DUMMY_FS_INFO,
+
+ /* Indicates there was an error cleaning up a log tree. */
+ BTRFS_FS_STATE_LOG_CLEANUP_ERROR,
};
#define BTRFS_BACKREF_REV_MAX 256
@@ -3578,6 +3581,10 @@ do { \
(errno), fmt, ##args); \
} while (0)
+#define BTRFS_FS_LOG_CLEANUP_ERROR(fs_info) \
+ (unlikely(test_bit(BTRFS_FS_STATE_LOG_CLEANUP_ERROR, \
+ &(fs_info)->fs_state)))
+
__printf(5, 6)
__cold
void __btrfs_panic(struct btrfs_fs_info *fs_info, const char *function,
diff --git a/fs/btrfs/tree-log.c b/fs/btrfs/tree-log.c
index 8ef65073ce8c..e90d80a8a9e3 100644
--- a/fs/btrfs/tree-log.c
+++ b/fs/btrfs/tree-log.c
@@ -3423,6 +3423,29 @@ static void free_log_tree(struct btrfs_trans_handle *trans,
if (log->node) {
ret = walk_log_tree(trans, log, &wc);
if (ret) {
+ /*
+ * We weren't able to traverse the entire log tree, the
+ * typical scenario is getting an -EIO when reading an
+ * extent buffer of the tree, due to a previous writeback
+ * failure of it.
+ */
+ set_bit(BTRFS_FS_STATE_LOG_CLEANUP_ERROR,
+ &log->fs_info->fs_state);
+
+ /*
+ * Some extent buffers of the log tree may still be dirty
+ * and not yet written back to storage, because we may
+ * have updates to a log tree without syncing a log tree,
+ * such as during rename and link operations. So flush
+ * them out and wait for their writeback to complete, so
+ * that we properly cleanup their state and pages.
+ */
+ btrfs_write_marked_extents(log->fs_info,
+ &log->dirty_log_pages,
+ EXTENT_DIRTY | EXTENT_NEW);
+ btrfs_wait_tree_log_extents(log,
+ EXTENT_DIRTY | EXTENT_NEW);
+
if (trans)
btrfs_abort_transaction(trans, ret);
else
--
2.33.1
Hi,
I just noticed that the stable repository has the linux-5.17.y tag and
no branch with the linux-5.17.y name. That tag looks like a copy of
Linus' v5.17.
I guess this is a mistake. On my side git refused to push the
linux-5.17.y branch because it already had a tag with the same name.
Could you please remove it?
Sebastian