Dear stable kernel team
I tested this on behalf of darrick.wong(a)oracle.com
I do not have push rights to your repo. I tried. Can you include it
for 5.4 and 5.5 ?
Comment from Darrick
This patch should make it so
that uswsusp can unlock the swap device prior to writing the memory
image to disk, and then lock it again when resuming.
Best regards and Thanks
Marian Klein
Hello,
We ran automated tests on a recent commit from this kernel tree:
Kernel repo: git://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable-rc.git
Commit: bfe3046ecafd - Linux 5.4.23
The results of these automated tests are provided below.
Overall result: FAILED (see details below)
Merge: OK
Compile: OK
Tests: FAILED
All kernel binaries, config files, and logs are available for download here:
https://cki-artifacts.s3.us-east-2.amazonaws.com/index.html?prefix=dataware…
One or more kernel tests failed:
ppc64le:
❌ selinux-policy: serge-testsuite
aarch64:
❌ selinux-policy: serge-testsuite
x86_64:
❌ selinux-policy: serge-testsuite
❌ Networking route_func - local
We hope that these logs can help you find the problem quickly. For the full
detail on our testing procedures, please scroll to the bottom of this message.
Please reply to this email if you have any questions about the tests that we
ran or if you have any suggestions on how to make future tests more effective.
,-. ,-.
( C ) ( K ) Continuous
`-',-.`-' Kernel
( I ) Integration
`-'
______________________________________________________________________________
Compile testing
---------------
We compiled the kernel for 3 architectures:
aarch64:
make options: -j30 INSTALL_MOD_STRIP=1 targz-pkg
ppc64le:
make options: -j30 INSTALL_MOD_STRIP=1 targz-pkg
x86_64:
make options: -j30 INSTALL_MOD_STRIP=1 targz-pkg
Hardware testing
----------------
We booted each kernel and ran the following tests:
aarch64:
Host 1:
⚡ Internal infrastructure issues prevented one or more tests (marked
with ⚡⚡⚡) from running on this architecture.
This is not the fault of the kernel that was tested.
✅ Boot test
⚡⚡⚡ Podman system integration test - as root
⚡⚡⚡ Podman system integration test - as user
⚡⚡⚡ LTP
⚡⚡⚡ Loopdev Sanity
⚡⚡⚡ Memory function: memfd_create
⚡⚡⚡ AMTU (Abstract Machine Test Utility)
⚡⚡⚡ Networking bridge: sanity
⚡⚡⚡ Ethernet drivers sanity
⚡⚡⚡ Networking MACsec: sanity
⚡⚡⚡ Networking socket: fuzz
⚡⚡⚡ Networking sctp-auth: sockopts test
⚡⚡⚡ Networking: igmp conformance test
⚡⚡⚡ Networking route: pmtu
⚡⚡⚡ Networking route_func - local
⚡⚡⚡ Networking route_func - forward
⚡⚡⚡ Networking TCP: keepalive test
⚡⚡⚡ Networking UDP: socket
⚡⚡⚡ Networking tunnel: geneve basic test
⚡⚡⚡ Networking tunnel: gre basic
⚡⚡⚡ L2TP basic test
⚡⚡⚡ Networking tunnel: vxlan basic
⚡⚡⚡ Networking ipsec: basic netns - transport
⚡⚡⚡ Networking ipsec: basic netns - tunnel
⚡⚡⚡ audit: audit testsuite test
⚡⚡⚡ httpd: mod_ssl smoke sanity
⚡⚡⚡ tuned: tune-processes-through-perf
⚡⚡⚡ ALSA PCM loopback test
⚡⚡⚡ ALSA Control (mixer) Userspace Element test
⚡⚡⚡ storage: SCSI VPD
⚡⚡⚡ trace: ftrace/tracer
🚧 ⚡⚡⚡ CIFS Connectathon
🚧 ⚡⚡⚡ POSIX pjd-fstest suites
🚧 ⚡⚡⚡ jvm - DaCapo Benchmark Suite
🚧 ⚡⚡⚡ jvm - jcstress tests
🚧 ⚡⚡⚡ Memory function: kaslr
🚧 ⚡⚡⚡ LTP: openposix test suite
🚧 ⚡⚡⚡ Networking vnic: ipvlan/basic
🚧 ⚡⚡⚡ iotop: sanity
🚧 ⚡⚡⚡ Usex - version 1.9-29
🚧 ⚡⚡⚡ storage: dm/common
Host 2:
⚡ Internal infrastructure issues prevented one or more tests (marked
with ⚡⚡⚡) from running on this architecture.
This is not the fault of the kernel that was tested.
✅ Boot test
✅ xfstests - ext4
⚡⚡⚡ xfstests - xfs
❌ selinux-policy: serge-testsuite
✅ lvm thinp sanity
✅ storage: software RAID testing
🚧 ⚡⚡⚡ Storage blktests
Host 3:
⚡ Internal infrastructure issues prevented one or more tests (marked
with ⚡⚡⚡) from running on this architecture.
This is not the fault of the kernel that was tested.
✅ Boot test
✅ Podman system integration test - as root
✅ Podman system integration test - as user
✅ LTP
⚡⚡⚡ Loopdev Sanity
✅ Memory function: memfd_create
✅ AMTU (Abstract Machine Test Utility)
✅ Networking bridge: sanity
✅ Ethernet drivers sanity
✅ Networking MACsec: sanity
✅ Networking socket: fuzz
✅ Networking sctp-auth: sockopts test
⚡⚡⚡ Networking: igmp conformance test
✅ Networking route: pmtu
✅ Networking route_func - local
✅ Networking route_func - forward
✅ Networking TCP: keepalive test
✅ Networking UDP: socket
✅ Networking tunnel: geneve basic test
✅ Networking tunnel: gre basic
✅ L2TP basic test
✅ Networking tunnel: vxlan basic
✅ Networking ipsec: basic netns - transport
✅ Networking ipsec: basic netns - tunnel
✅ audit: audit testsuite test
✅ httpd: mod_ssl smoke sanity
✅ tuned: tune-processes-through-perf
⚡⚡⚡ ALSA PCM loopback test
⚡⚡⚡ ALSA Control (mixer) Userspace Element test
✅ storage: SCSI VPD
✅ trace: ftrace/tracer
🚧 ✅ CIFS Connectathon
🚧 ✅ POSIX pjd-fstest suites
🚧 ✅ jvm - DaCapo Benchmark Suite
🚧 ✅ jvm - jcstress tests
🚧 ✅ Memory function: kaslr
🚧 ✅ LTP: openposix test suite
🚧 ✅ Networking vnic: ipvlan/basic
🚧 ✅ iotop: sanity
🚧 ✅ Usex - version 1.9-29
🚧 ✅ storage: dm/common
Host 4:
✅ Boot test
✅ Podman system integration test - as root
⏱ Podman system integration test - as user
⏱ LTP
⏱ Loopdev Sanity
⏱ Memory function: memfd_create
⏱ AMTU (Abstract Machine Test Utility)
⏱ Networking bridge: sanity
⏱ Ethernet drivers sanity
⏱ Networking MACsec: sanity
⏱ Networking socket: fuzz
⏱ Networking sctp-auth: sockopts test
⏱ Networking: igmp conformance test
⏱ Networking route: pmtu
⏱ Networking route_func - local
⏱ Networking route_func - forward
⏱ Networking TCP: keepalive test
⏱ Networking UDP: socket
⏱ Networking tunnel: geneve basic test
⏱ Networking tunnel: gre basic
⏱ L2TP basic test
⏱ Networking tunnel: vxlan basic
⏱ Networking ipsec: basic netns - transport
⏱ Networking ipsec: basic netns - tunnel
⏱ audit: audit testsuite test
⏱ httpd: mod_ssl smoke sanity
⏱ tuned: tune-processes-through-perf
⏱ ALSA PCM loopback test
⏱ ALSA Control (mixer) Userspace Element test
⏱ storage: SCSI VPD
⏱ trace: ftrace/tracer
⏱ CIFS Connectathon
⏱ POSIX pjd-fstest suites
⏱ jvm - DaCapo Benchmark Suite
⏱ jvm - jcstress tests
⏱ Memory function: kaslr
⏱ LTP: openposix test suite
⏱ Networking vnic: ipvlan/basic
⏱ iotop: sanity
⏱ Usex - version 1.9-29
⏱ storage: dm/common
ppc64le:
Host 1:
✅ Boot test
✅ Podman system integration test - as root
✅ Podman system integration test - as user
✅ LTP
✅ Loopdev Sanity
✅ Memory function: memfd_create
✅ AMTU (Abstract Machine Test Utility)
✅ Networking bridge: sanity
✅ Ethernet drivers sanity
✅ Networking MACsec: sanity
✅ Networking socket: fuzz
✅ Networking sctp-auth: sockopts test
✅ Networking route: pmtu
✅ Networking route_func - local
✅ Networking route_func - forward
✅ Networking TCP: keepalive test
✅ Networking UDP: socket
✅ Networking tunnel: geneve basic test
✅ Networking tunnel: gre basic
✅ L2TP basic test
✅ Networking tunnel: vxlan basic
✅ Networking ipsec: basic netns - tunnel
✅ audit: audit testsuite test
✅ httpd: mod_ssl smoke sanity
✅ tuned: tune-processes-through-perf
✅ ALSA PCM loopback test
✅ ALSA Control (mixer) Userspace Element test
✅ trace: ftrace/tracer
🚧 ✅ CIFS Connectathon
⏱ POSIX pjd-fstest suites
⏱ jvm - DaCapo Benchmark Suite
⏱ jvm - jcstress tests
⏱ Memory function: kaslr
⏱ LTP: openposix test suite
⏱ Networking vnic: ipvlan/basic
⏱ iotop: sanity
⏱ Usex - version 1.9-29
⏱ storage: dm/common
Host 2:
✅ Boot test
✅ xfstests - ext4
✅ xfstests - xfs
❌ selinux-policy: serge-testsuite
✅ lvm thinp sanity
✅ storage: software RAID testing
🚧 ✅ IPMI driver test
🚧 ✅ IPMItool loop stress test
🚧 ✅ Storage blktests
x86_64:
Host 1:
⚡ Internal infrastructure issues prevented one or more tests (marked
with ⚡⚡⚡) from running on this architecture.
This is not the fault of the kernel that was tested.
✅ Boot test
✅ xfstests - ext4
✅ xfstests - xfs
❌ selinux-policy: serge-testsuite
✅ lvm thinp sanity
✅ storage: software RAID testing
⚡⚡⚡ stress: stress-ng
🚧 ✅ IOMMU boot test
🚧 ⚡⚡⚡ IPMI driver test
🚧 ✅ IPMItool loop stress test
🚧 ✅ Storage blktests
Host 2:
⚡ Internal infrastructure issues prevented one or more tests (marked
with ⚡⚡⚡) from running on this architecture.
This is not the fault of the kernel that was tested.
⚡⚡⚡ Boot test
⚡⚡⚡ Podman system integration test - as root
⚡⚡⚡ Podman system integration test - as user
⚡⚡⚡ LTP
⚡⚡⚡ Loopdev Sanity
⚡⚡⚡ Memory function: memfd_create
⚡⚡⚡ AMTU (Abstract Machine Test Utility)
⚡⚡⚡ Networking bridge: sanity
⚡⚡⚡ Ethernet drivers sanity
⚡⚡⚡ Networking MACsec: sanity
⚡⚡⚡ Networking socket: fuzz
⚡⚡⚡ Networking sctp-auth: sockopts test
⚡⚡⚡ Networking: igmp conformance test
⚡⚡⚡ Networking route: pmtu
⚡⚡⚡ Networking route_func - local
⚡⚡⚡ Networking route_func - forward
⚡⚡⚡ Networking TCP: keepalive test
⚡⚡⚡ Networking UDP: socket
⚡⚡⚡ Networking tunnel: geneve basic test
⚡⚡⚡ Networking tunnel: gre basic
⚡⚡⚡ L2TP basic test
⚡⚡⚡ Networking tunnel: vxlan basic
⚡⚡⚡ Networking ipsec: basic netns - transport
⚡⚡⚡ Networking ipsec: basic netns - tunnel
⚡⚡⚡ audit: audit testsuite test
⚡⚡⚡ httpd: mod_ssl smoke sanity
⚡⚡⚡ tuned: tune-processes-through-perf
⚡⚡⚡ pciutils: sanity smoke test
⚡⚡⚡ ALSA PCM loopback test
⚡⚡⚡ ALSA Control (mixer) Userspace Element test
⚡⚡⚡ storage: SCSI VPD
⚡⚡⚡ trace: ftrace/tracer
🚧 ⚡⚡⚡ CIFS Connectathon
🚧 ⚡⚡⚡ POSIX pjd-fstest suites
🚧 ⚡⚡⚡ jvm - DaCapo Benchmark Suite
🚧 ⚡⚡⚡ jvm - jcstress tests
🚧 ⚡⚡⚡ Memory function: kaslr
🚧 ⚡⚡⚡ LTP: openposix test suite
🚧 ⚡⚡⚡ Networking vnic: ipvlan/basic
🚧 ⚡⚡⚡ iotop: sanity
🚧 ⚡⚡⚡ Usex - version 1.9-29
🚧 ⚡⚡⚡ storage: dm/common
Host 3:
✅ Boot test
✅ Storage SAN device stress - mpt3sas driver
Host 4:
✅ Boot test
✅ Storage SAN device stress - megaraid_sas
Host 5:
⚡ Internal infrastructure issues prevented one or more tests (marked
with ⚡⚡⚡) from running on this architecture.
This is not the fault of the kernel that was tested.
⚡⚡⚡ Boot test
⚡⚡⚡ Podman system integration test - as root
⚡⚡⚡ Podman system integration test - as user
⚡⚡⚡ LTP
⚡⚡⚡ Loopdev Sanity
⚡⚡⚡ Memory function: memfd_create
⚡⚡⚡ AMTU (Abstract Machine Test Utility)
⚡⚡⚡ Networking bridge: sanity
⚡⚡⚡ Ethernet drivers sanity
⚡⚡⚡ Networking MACsec: sanity
⚡⚡⚡ Networking socket: fuzz
⚡⚡⚡ Networking sctp-auth: sockopts test
⚡⚡⚡ Networking: igmp conformance test
⚡⚡⚡ Networking route: pmtu
⚡⚡⚡ Networking route_func - local
⚡⚡⚡ Networking route_func - forward
⚡⚡⚡ Networking TCP: keepalive test
⚡⚡⚡ Networking UDP: socket
⚡⚡⚡ Networking tunnel: geneve basic test
⚡⚡⚡ Networking tunnel: gre basic
⚡⚡⚡ L2TP basic test
⚡⚡⚡ Networking tunnel: vxlan basic
⚡⚡⚡ Networking ipsec: basic netns - transport
⚡⚡⚡ Networking ipsec: basic netns - tunnel
⚡⚡⚡ audit: audit testsuite test
⚡⚡⚡ httpd: mod_ssl smoke sanity
⚡⚡⚡ tuned: tune-processes-through-perf
⚡⚡⚡ pciutils: sanity smoke test
⚡⚡⚡ ALSA PCM loopback test
⚡⚡⚡ ALSA Control (mixer) Userspace Element test
⚡⚡⚡ storage: SCSI VPD
⚡⚡⚡ trace: ftrace/tracer
🚧 ⚡⚡⚡ CIFS Connectathon
🚧 ⚡⚡⚡ POSIX pjd-fstest suites
🚧 ⚡⚡⚡ jvm - DaCapo Benchmark Suite
🚧 ⚡⚡⚡ jvm - jcstress tests
🚧 ⚡⚡⚡ Memory function: kaslr
🚧 ⚡⚡⚡ LTP: openposix test suite
🚧 ⚡⚡⚡ Networking vnic: ipvlan/basic
🚧 ⚡⚡⚡ iotop: sanity
🚧 ⚡⚡⚡ Usex - version 1.9-29
🚧 ⚡⚡⚡ storage: dm/common
Host 6:
✅ Boot test
✅ Podman system integration test - as root
✅ Podman system integration test - as user
✅ LTP
✅ Loopdev Sanity
✅ Memory function: memfd_create
✅ AMTU (Abstract Machine Test Utility)
✅ Networking bridge: sanity
✅ Ethernet drivers sanity
✅ Networking MACsec: sanity
✅ Networking socket: fuzz
✅ Networking sctp-auth: sockopts test
✅ Networking: igmp conformance test
✅ Networking route: pmtu
❌ Networking route_func - local
✅ Networking route_func - forward
✅ Networking TCP: keepalive test
✅ Networking UDP: socket
✅ Networking tunnel: geneve basic test
✅ Networking tunnel: gre basic
✅ L2TP basic test
✅ Networking tunnel: vxlan basic
✅ Networking ipsec: basic netns - transport
✅ Networking ipsec: basic netns - tunnel
✅ audit: audit testsuite test
✅ httpd: mod_ssl smoke sanity
✅ tuned: tune-processes-through-perf
✅ pciutils: sanity smoke test
✅ ALSA PCM loopback test
✅ ALSA Control (mixer) Userspace Element test
✅ storage: SCSI VPD
✅ trace: ftrace/tracer
🚧 ✅ CIFS Connectathon
🚧 ✅ POSIX pjd-fstest suites
🚧 ✅ jvm - DaCapo Benchmark Suite
🚧 ✅ jvm - jcstress tests
🚧 ✅ Memory function: kaslr
🚧 ✅ LTP: openposix test suite
🚧 ✅ Networking vnic: ipvlan/basic
🚧 ✅ iotop: sanity
🚧 ✅ Usex - version 1.9-29
🚧 ✅ storage: dm/common
Test sources: https://github.com/CKI-project/tests-beaker
💚 Pull requests are welcome for new tests or improvements to existing tests!
Waived tests
------------
If the test run included waived tests, they are marked with 🚧. Such tests are
executed but their results are not taken into account. Tests are waived when
their results are not reliable enough, e.g. when they're just introduced or are
being fixed.
Testing timeout
---------------
We aim to provide a report within reasonable timeframe. Tests that haven't
finished running yet are marked with ⏱.
Hi All,
I and various other Fedora users have noticed that Fedora's 5.5.6 build gets stuck
at boot on a Lenovo X1 7th gen, see:
https://bugzilla.redhat.com/show_bug.cgi?id=1772498
This is caused by the addition of this commit to 5.5.6:
24c259557c45 ("ASoC: SOF: Intel: hda: Fix SKL dai count")
https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?h=…
###
ASoC: SOF: Intel: hda: Fix SKL dai count
[ Upstream commit a6947c9d86bcfd61b758b5693eba58defe7fd2ae ]
With fourth pin added for iDisp for skl_dai, update SOF_SKL_DAI_NUM to
account for the change. Without this, dais from the bottom of the list
are skipped. In current state that's the case for 'Alt Analog CPU DAI'.
Fixes: ac42b142cd76 ("ASoC: SOF: Intel: hda: Add iDisp4 DAI")
Signed-off-by: Cezary Rojewski <cezary.rojewski(a)intel.com>
Reviewed-by: Pierre-Louis Bossart <pierre-louis.bossart(a)linux.intel.com>
Link: https://lore.kernel.org/r/20200113114054.9716-1-cezary.rojewski@intel.com
Signed-off-by: Mark Brown <broonie(a)kernel.org>
Signed-off-by: Sasha Levin <sashal(a)kernel.org>
###
Notice the "Fixes: ac42b142cd76 (...)", that commit-id actually
does not exist, the correct commit-id which this fixes is:
e68d6696575e ("ASoC: SOF: Intel: hda: Add iDisp4 DAI")
and that commit is not in 5.5.6, which is causing the problem,
the missing commit makes an array one larger and the fix for the
missing fix which did end up in 5.5.6 and bumps a define which is
used to walk over the array in some places by one so now the
walking is going over the array boundary.
For the Fedora kernels I've fixed this by adding the
"ASoC: SOF: Intel: hda: Add iDisp4 DAI" commit as a downstream
patch for our kernels. I believe that this is probably the best
fix for 5.5.z too.
Regards,
Hans
p.s.
I know that the stable series are partly based on automatically
picking patches now. I wonder if the scripts doing that could be
made smarter wrt rejecting patches with a Fixes tag where the
fixed patch is not present, so where in essence a pre-requisite
of the patch being added is missing ?
Hi Greg, Sasha,
Could you please help to backport the upstream commit
20faba848752901de23a4d45a1174d64d2069dde ("irqchip/gic-v3-its: Fix
misuse of GENMASK macro") to the stable linux-4.19 tree since it's
definitely a fix for 4.19. Note that it's not needed for other stable
trees.
I've tried and it can be applied to 4.19.106 cleanly.
Thanks,
Zenghui
Hi,
We don't have these two commits in 5.4-stable:
ff002b30181d30cdfbca316dadd099c3ca0d739c
9392a27d88b9707145d713654eb26f0c29789e50
because they don't apply with the rework that was done in how io_uring
handles offload. Since there's no io-wq in 5.4, it doesn't make sense to
do two patches. I'm attaching my port of the two for 5.4-stable, it's
been tested. Please queue it up for the next 5.4-stable, thanks!
--
Jens Axboe
This is a note to let you know that I've just added the patch titled
vt: selection, push sel_lock up
to my tty git tree which can be found at
git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty.git
in the tty-linus branch.
The patch will show up in the next release of the linux-next tree
(usually sometime within the next 24 hours during the week.)
The patch will hopefully also be merged in Linus's tree for the
next -rc kernel release.
If you have any questions about this process, please let me know.
>From e8c75a30a23c6ba63f4ef6895cbf41fd42f21aa2 Mon Sep 17 00:00:00 2001
From: Jiri Slaby <jslaby(a)suse.cz>
Date: Fri, 28 Feb 2020 12:54:06 +0100
Subject: vt: selection, push sel_lock up
sel_lock cannot nest in the console lock. Thanks to syzkaller, the
kernel states firmly:
> WARNING: possible circular locking dependency detected
> 5.6.0-rc3-syzkaller #0 Not tainted
> ------------------------------------------------------
> syz-executor.4/20336 is trying to acquire lock:
> ffff8880a2e952a0 (&tty->termios_rwsem){++++}, at: tty_unthrottle+0x22/0x100 drivers/tty/tty_ioctl.c:136
>
> but task is already holding lock:
> ffffffff89462e70 (sel_lock){+.+.}, at: paste_selection+0x118/0x470 drivers/tty/vt/selection.c:374
>
> which lock already depends on the new lock.
>
> the existing dependency chain (in reverse order) is:
>
> -> #2 (sel_lock){+.+.}:
> mutex_lock_nested+0x1b/0x30 kernel/locking/mutex.c:1118
> set_selection_kernel+0x3b8/0x18a0 drivers/tty/vt/selection.c:217
> set_selection_user+0x63/0x80 drivers/tty/vt/selection.c:181
> tioclinux+0x103/0x530 drivers/tty/vt/vt.c:3050
> vt_ioctl+0x3f1/0x3a30 drivers/tty/vt/vt_ioctl.c:364
This is ioctl(TIOCL_SETSEL).
Locks held on the path: console_lock -> sel_lock
> -> #1 (console_lock){+.+.}:
> console_lock+0x46/0x70 kernel/printk/printk.c:2289
> con_flush_chars+0x50/0x650 drivers/tty/vt/vt.c:3223
> n_tty_write+0xeae/0x1200 drivers/tty/n_tty.c:2350
> do_tty_write drivers/tty/tty_io.c:962 [inline]
> tty_write+0x5a1/0x950 drivers/tty/tty_io.c:1046
This is write().
Locks held on the path: termios_rwsem -> console_lock
> -> #0 (&tty->termios_rwsem){++++}:
> down_write+0x57/0x140 kernel/locking/rwsem.c:1534
> tty_unthrottle+0x22/0x100 drivers/tty/tty_ioctl.c:136
> mkiss_receive_buf+0x12aa/0x1340 drivers/net/hamradio/mkiss.c:902
> tty_ldisc_receive_buf+0x12f/0x170 drivers/tty/tty_buffer.c:465
> paste_selection+0x346/0x470 drivers/tty/vt/selection.c:389
> tioclinux+0x121/0x530 drivers/tty/vt/vt.c:3055
> vt_ioctl+0x3f1/0x3a30 drivers/tty/vt/vt_ioctl.c:364
This is ioctl(TIOCL_PASTESEL).
Locks held on the path: sel_lock -> termios_rwsem
> other info that might help us debug this:
>
> Chain exists of:
> &tty->termios_rwsem --> console_lock --> sel_lock
Clearly. From the above, we have:
console_lock -> sel_lock
sel_lock -> termios_rwsem
termios_rwsem -> console_lock
Fix this by reversing the console_lock -> sel_lock dependency in
ioctl(TIOCL_SETSEL). First, lock sel_lock, then console_lock.
Signed-off-by: Jiri Slaby <jslaby(a)suse.cz>
Reported-by: syzbot+26183d9746e62da329b8(a)syzkaller.appspotmail.com
Fixes: 07e6124a1a46 ("vt: selection, close sel_buffer race")
Cc: stable <stable(a)vger.kernel.org>
Link: https://lore.kernel.org/r/20200228115406.5735-2-jslaby@suse.cz
Signed-off-by: Greg Kroah-Hartman <gregkh(a)linuxfoundation.org>
---
drivers/tty/vt/selection.c | 16 +++++++---------
1 file changed, 7 insertions(+), 9 deletions(-)
diff --git a/drivers/tty/vt/selection.c b/drivers/tty/vt/selection.c
index 9126a01290ea..d7d2e4b844bc 100644
--- a/drivers/tty/vt/selection.c
+++ b/drivers/tty/vt/selection.c
@@ -214,7 +214,6 @@ static int __set_selection_kernel(struct tiocl_selection *v, struct tty_struct *
if (ps > pe) /* make sel_start <= sel_end */
swap(ps, pe);
- mutex_lock(&sel_lock);
if (sel_cons != vc_cons[fg_console].d) {
clear_selection();
sel_cons = vc_cons[fg_console].d;
@@ -260,10 +259,9 @@ static int __set_selection_kernel(struct tiocl_selection *v, struct tty_struct *
break;
case TIOCL_SELPOINTER:
highlight_pointer(pe);
- goto unlock;
+ return 0;
default:
- ret = -EINVAL;
- goto unlock;
+ return -EINVAL;
}
/* remove the pointer */
@@ -285,7 +283,7 @@ static int __set_selection_kernel(struct tiocl_selection *v, struct tty_struct *
else if (new_sel_start == sel_start)
{
if (new_sel_end == sel_end) /* no action required */
- goto unlock;
+ return 0;
else if (new_sel_end > sel_end) /* extend to right */
highlight(sel_end + 2, new_sel_end);
else /* contract from right */
@@ -313,8 +311,7 @@ static int __set_selection_kernel(struct tiocl_selection *v, struct tty_struct *
if (!bp) {
printk(KERN_WARNING "selection: kmalloc() failed\n");
clear_selection();
- ret = -ENOMEM;
- goto unlock;
+ return -ENOMEM;
}
kfree(sel_buffer);
sel_buffer = bp;
@@ -339,8 +336,7 @@ static int __set_selection_kernel(struct tiocl_selection *v, struct tty_struct *
}
}
sel_buffer_lth = bp - sel_buffer;
-unlock:
- mutex_unlock(&sel_lock);
+
return ret;
}
@@ -348,9 +344,11 @@ int set_selection_kernel(struct tiocl_selection *v, struct tty_struct *tty)
{
int ret;
+ mutex_lock(&sel_lock);
console_lock();
ret = __set_selection_kernel(v, tty);
console_unlock();
+ mutex_unlock(&sel_lock);
return ret;
}
--
2.25.1
This is a note to let you know that I've just added the patch titled
vt: selection, push console lock down
to my tty git tree which can be found at
git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty.git
in the tty-linus branch.
The patch will show up in the next release of the linux-next tree
(usually sometime within the next 24 hours during the week.)
The patch will hopefully also be merged in Linus's tree for the
next -rc kernel release.
If you have any questions about this process, please let me know.
>From 4b70dd57a15d2f4685ac6e38056bad93e81e982f Mon Sep 17 00:00:00 2001
From: Jiri Slaby <jslaby(a)suse.cz>
Date: Fri, 28 Feb 2020 12:54:05 +0100
Subject: vt: selection, push console lock down
We need to nest the console lock in sel_lock, so we have to push it down
a bit. Fortunately, the callers of set_selection_* just lock the console
lock around the function call. So moving it down is easy.
In the next patch, we switch the order.
Signed-off-by: Jiri Slaby <jslaby(a)suse.cz>
Fixes: 07e6124a1a46 ("vt: selection, close sel_buffer race")
Cc: stable <stable(a)vger.kernel.org>
Link: https://lore.kernel.org/r/20200228115406.5735-1-jslaby@suse.cz
Signed-off-by: Greg Kroah-Hartman <gregkh(a)linuxfoundation.org>
---
drivers/staging/speakup/selection.c | 2 --
drivers/tty/vt/selection.c | 13 ++++++++++++-
drivers/tty/vt/vt.c | 2 --
3 files changed, 12 insertions(+), 5 deletions(-)
diff --git a/drivers/staging/speakup/selection.c b/drivers/staging/speakup/selection.c
index a8b4d0c5ab7e..032f3264fba1 100644
--- a/drivers/staging/speakup/selection.c
+++ b/drivers/staging/speakup/selection.c
@@ -51,9 +51,7 @@ static void __speakup_set_selection(struct work_struct *work)
goto unref;
}
- console_lock();
set_selection_kernel(&sel, tty);
- console_unlock();
unref:
tty_kref_put(tty);
diff --git a/drivers/tty/vt/selection.c b/drivers/tty/vt/selection.c
index 0c50d7410b31..9126a01290ea 100644
--- a/drivers/tty/vt/selection.c
+++ b/drivers/tty/vt/selection.c
@@ -181,7 +181,7 @@ int set_selection_user(const struct tiocl_selection __user *sel,
return set_selection_kernel(&v, tty);
}
-int set_selection_kernel(struct tiocl_selection *v, struct tty_struct *tty)
+static int __set_selection_kernel(struct tiocl_selection *v, struct tty_struct *tty)
{
struct vc_data *vc = vc_cons[fg_console].d;
int new_sel_start, new_sel_end, spc;
@@ -343,6 +343,17 @@ int set_selection_kernel(struct tiocl_selection *v, struct tty_struct *tty)
mutex_unlock(&sel_lock);
return ret;
}
+
+int set_selection_kernel(struct tiocl_selection *v, struct tty_struct *tty)
+{
+ int ret;
+
+ console_lock();
+ ret = __set_selection_kernel(v, tty);
+ console_unlock();
+
+ return ret;
+}
EXPORT_SYMBOL_GPL(set_selection_kernel);
/* Insert the contents of the selection buffer into the
diff --git a/drivers/tty/vt/vt.c b/drivers/tty/vt/vt.c
index 0cfbb7182b5a..15d27698054a 100644
--- a/drivers/tty/vt/vt.c
+++ b/drivers/tty/vt/vt.c
@@ -3046,10 +3046,8 @@ int tioclinux(struct tty_struct *tty, unsigned long arg)
switch (type)
{
case TIOCL_SETSEL:
- console_lock();
ret = set_selection_user((struct tiocl_selection
__user *)(p+1), tty);
- console_unlock();
break;
case TIOCL_PASTESEL:
ret = paste_selection(tty);
--
2.25.1
From: Andrei Botila <andrei.botila(a)nxp.com>
Since in the software implementation of XTS-AES there is
no notion of sector every input length is processed the same way.
CAAM implementation has the notion of sector which causes different
results between the software implementation and the one in CAAM
for input lengths bigger than 512 bytes.
Increase sector size to maximum value on 16 bits.
Fixes: c6415a6016bf ("crypto: caam - add support for acipher xts(aes)")
Cc: <stable(a)vger.kernel.org> # v4.12+
Signed-off-by: Andrei Botila <andrei.botila(a)nxp.com>
---
drivers/crypto/caam/caamalg_desc.c | 16 ++++++++++++++--
1 file changed, 14 insertions(+), 2 deletions(-)
diff --git a/drivers/crypto/caam/caamalg_desc.c b/drivers/crypto/caam/caamalg_desc.c
index aa9ccca67045..8ebbbd28b1f7 100644
--- a/drivers/crypto/caam/caamalg_desc.c
+++ b/drivers/crypto/caam/caamalg_desc.c
@@ -1518,7 +1518,13 @@ EXPORT_SYMBOL(cnstr_shdsc_skcipher_decap);
*/
void cnstr_shdsc_xts_skcipher_encap(u32 * const desc, struct alginfo *cdata)
{
- __be64 sector_size = cpu_to_be64(512);
+ /*
+ * Set sector size to a big value, practically disabling
+ * sector size segmentation in xts implementation. We cannot
+ * take full advantage of this HW feature with existing
+ * crypto API / dm-crypt SW architecture.
+ */
+ __be64 sector_size = cpu_to_be64(BIT(15));
u32 *key_jump_cmd;
init_sh_desc(desc, HDR_SHARE_SERIAL | HDR_SAVECTX);
@@ -1571,7 +1577,13 @@ EXPORT_SYMBOL(cnstr_shdsc_xts_skcipher_encap);
*/
void cnstr_shdsc_xts_skcipher_decap(u32 * const desc, struct alginfo *cdata)
{
- __be64 sector_size = cpu_to_be64(512);
+ /*
+ * Set sector size to a big value, practically disabling
+ * sector size segmentation in xts implementation. We cannot
+ * take full advantage of this HW feature with existing
+ * crypto API / dm-crypt SW architecture.
+ */
+ __be64 sector_size = cpu_to_be64(BIT(15));
u32 *key_jump_cmd;
init_sh_desc(desc, HDR_SHARE_SERIAL | HDR_SAVECTX);
--
2.17.1
4.19.99 seems to have picked up quite a few device link patches.
However, one of them [1] broke the ability to add stateful and
stateless links between the same two devices. Looks like the breakage
was fixed in the mainline kernel in subsequent commits [2] and [3].
Can we pull [2] and [3] into 4.19 please? And any other intermediate
device link patches up to [3].
[1] - 6fdc440366f1a99f344b629ac92f350aefd77911
[2] - 515db266a9da
[3] - fb583c8eeeb1 (this fixed a bug in [2])
Thanks,
Saravana
From: Jann Horn <jannh(a)google.com>
[ Upstream commit 6ab405114b0b229151ef06f4e31c7834dd09d0c0 ]
Check whether inputs from userspace are too long (explicit length field too
big or string not null-terminated) to avoid out-of-bounds reads.
As far as I can tell, this can at worst lead to very limited kernel heap
memory disclosure or oopses.
This bug can be triggered by an unprivileged user even if the xt_bpf module
is not loaded: iptables is available in network namespaces, and the xt_bpf
module can be autoloaded.
Triggering the bug with a classic BPF filter with fake length 0x1000 causes
the following KASAN report:
==================================================================
BUG: KASAN: slab-out-of-bounds in bpf_prog_create+0x84/0xf0
Read of size 32768 at addr ffff8801eff2c494 by task test/4627
CPU: 0 PID: 4627 Comm: test Not tainted 4.15.0-rc1+ #1
[...]
Call Trace:
dump_stack+0x5c/0x85
print_address_description+0x6a/0x260
kasan_report+0x254/0x370
? bpf_prog_create+0x84/0xf0
memcpy+0x1f/0x50
bpf_prog_create+0x84/0xf0
bpf_mt_check+0x90/0xd6 [xt_bpf]
[...]
Allocated by task 4627:
kasan_kmalloc+0xa0/0xd0
__kmalloc_node+0x47/0x60
xt_alloc_table_info+0x41/0x70 [x_tables]
[...]
The buggy address belongs to the object at ffff8801eff2c3c0
which belongs to the cache kmalloc-2048 of size 2048
The buggy address is located 212 bytes inside of
2048-byte region [ffff8801eff2c3c0, ffff8801eff2cbc0)
[...]
==================================================================
Fixes: e6f30c731718 ("netfilter: x_tables: add xt_bpf match")
Signed-off-by: Jann Horn <jannh(a)google.com>
Signed-off-by: Pablo Neira Ayuso <pablo(a)netfilter.org>
Signed-off-by: Zubin Mithra <zsm(a)chromium.org>
---
* Syzkaller triggered an OOB read with the following stacktrace:
__dump_stack lib/dump_stack.c:15 [inline]
dump_stack+0xcb/0x130 lib/dump_stack.c:56
print_trailer+0x151/0x15a mm/slub.c:658
object_err+0x35/0x3d mm/slub.c:665
print_address_description mm/kasan/report.c:188 [inline]
kasan_report_error mm/kasan/report.c:285 [inline]
kasan_report.part.0.cold+0x21a/0x4c1 mm/kasan/report.c:310
kasan_report+0x25/0x30 mm/kasan/report.c:297
check_memory_region_inline mm/kasan/kasan.c:292 [inline]
check_memory_region+0x116/0x190 mm/kasan/kasan.c:299
memcpy+0x24/0x50 mm/kasan/kasan.c:334
bpf_prog_create+0xdd/0x230 net/core/filter.c:1075
bpf_mt_check+0xb1/0x100 net/netfilter/xt_bpf.c:31
xt_check_match+0x26b/0x560 net/netfilter/x_tables.c:444
check_match net/ipv4/netfilter/ip_tables.c:593 [inline]
find_check_match net/ipv4/netfilter/ip_tables.c:616 [inline]
find_check_entry.isra.0+0x2fe/0x950 net/ipv4/netfilter/ip_tables.c:673
translate_table+0xb3f/0x15b0 net/ipv4/netfilter/ip_tables.c:878
do_replace net/ipv4/netfilter/ip_tables.c:1300 [inline]
do_ipt_set_ctl+0x2cf/0x460 net/ipv4/netfilter/ip_tables.c:1864
nf_sockopt net/netfilter/nf_sockopt.c:105 [inline]
nf_setsockopt+0x72/0xd0 net/netfilter/nf_sockopt.c:114
ip_setsockopt net/ipv4/ip_sockglue.c:1227 [inline]
ip_setsockopt+0xa0/0xb0 net/ipv4/ip_sockglue.c:1212
udp_setsockopt+0x53/0x90 net/ipv4/udp.c:2162
sock_common_setsockopt+0x9f/0xe0 net/core/sock.c:2697
SYSC_setsockopt net/socket.c:1769 [inline]
SyS_setsockopt+0x146/0x210 net/socket.c:1748
entry_SYSCALL_64_fastpath+0x1e/0xa0
* This commit is present in linux-4.14.y.
* This patch resolves conflicts that arise due to the following commit
not being present in linux-4.9.y, linux-4.4.y:
- 2c16d6033264 ("netfilter: xt_bpf: support ebpf")
* Tests run: syzkaller reproducer
net/netfilter/xt_bpf.c | 3 +++
1 file changed, 3 insertions(+)
diff --git a/net/netfilter/xt_bpf.c b/net/netfilter/xt_bpf.c
index dffee9d47ec4b..7b993f25aab92 100644
--- a/net/netfilter/xt_bpf.c
+++ b/net/netfilter/xt_bpf.c
@@ -25,6 +25,9 @@ static int bpf_mt_check(const struct xt_mtchk_param *par)
struct xt_bpf_info *info = par->matchinfo;
struct sock_fprog_kern program;
+ if (info->bpf_program_num_elem > XT_BPF_MAX_NUM_INSTR)
+ return -EINVAL;
+
program.len = info->bpf_program_num_elem;
program.filter = info->bpf_program;
--
2.25.0.265.gbab2e86ba0-goog
From: Martynas Pumputis <martynas(a)weave.works>
[ Upstream commit ed07d9a021df6da53456663a76999189badc432a ]
This patch enables the clash resolution for NAT (disabled in
"590b52e10d41") if clashing conntracks match (i.e. both tuples are equal)
and a protocol allows it.
The clash might happen for a connections-less protocol (e.g. UDP) when
two threads in parallel writes to the same socket and consequent calls
to "get_unique_tuple" return the same tuples (incl. reply tuples).
In this case it is safe to perform the resolution, as the losing CT
describes the same mangling as the winning CT, so no modifications to
the packet are needed, and the result of rules traversal for the loser's
packet stays valid.
Signed-off-by: Martynas Pumputis <martynas(a)weave.works>
Signed-off-by: Pablo Neira Ayuso <pablo(a)netfilter.org>
Signed-off-by: Andy Strohman <astroh(a)amazon.com>
---
net/netfilter/nf_conntrack_core.c | 30 ++++++++++++++++++++++--------
1 file changed, 22 insertions(+), 8 deletions(-)
diff --git a/net/netfilter/nf_conntrack_core.c b/net/netfilter/nf_conntrack_core.c
index 5123e91b1982..4ced7c7102b6 100644
--- a/net/netfilter/nf_conntrack_core.c
+++ b/net/netfilter/nf_conntrack_core.c
@@ -632,6 +632,18 @@ nf_ct_key_equal(struct nf_conntrack_tuple_hash *h,
net_eq(net, nf_ct_net(ct));
}
+static inline bool
+nf_ct_match(const struct nf_conn *ct1, const struct nf_conn *ct2)
+{
+ return nf_ct_tuple_equal(&ct1->tuplehash[IP_CT_DIR_ORIGINAL].tuple,
+ &ct2->tuplehash[IP_CT_DIR_ORIGINAL].tuple) &&
+ nf_ct_tuple_equal(&ct1->tuplehash[IP_CT_DIR_REPLY].tuple,
+ &ct2->tuplehash[IP_CT_DIR_REPLY].tuple) &&
+ nf_ct_zone_equal(ct1, nf_ct_zone(ct2), IP_CT_DIR_ORIGINAL) &&
+ nf_ct_zone_equal(ct1, nf_ct_zone(ct2), IP_CT_DIR_REPLY) &&
+ net_eq(nf_ct_net(ct1), nf_ct_net(ct2));
+}
+
/* caller must hold rcu readlock and none of the nf_conntrack_locks */
static void nf_ct_gc_expired(struct nf_conn *ct)
{
@@ -825,19 +837,21 @@ static int nf_ct_resolve_clash(struct net *net, struct sk_buff *skb,
/* This is the conntrack entry already in hashes that won race. */
struct nf_conn *ct = nf_ct_tuplehash_to_ctrack(h);
const struct nf_conntrack_l4proto *l4proto;
+ enum ip_conntrack_info oldinfo;
+ struct nf_conn *loser_ct = nf_ct_get(skb, &oldinfo);
l4proto = __nf_ct_l4proto_find(nf_ct_l3num(ct), nf_ct_protonum(ct));
if (l4proto->allow_clash &&
- ((ct->status & IPS_NAT_DONE_MASK) == 0) &&
!nf_ct_is_dying(ct) &&
atomic_inc_not_zero(&ct->ct_general.use)) {
- enum ip_conntrack_info oldinfo;
- struct nf_conn *loser_ct = nf_ct_get(skb, &oldinfo);
-
- nf_ct_acct_merge(ct, ctinfo, loser_ct);
- nf_conntrack_put(&loser_ct->ct_general);
- nf_ct_set(skb, ct, oldinfo);
- return NF_ACCEPT;
+ if (((ct->status & IPS_NAT_DONE_MASK) == 0) ||
+ nf_ct_match(ct, loser_ct)) {
+ nf_ct_acct_merge(ct, ctinfo, loser_ct);
+ nf_conntrack_put(&loser_ct->ct_general);
+ nf_ct_set(skb, ct, oldinfo);
+ return NF_ACCEPT;
+ }
+ nf_ct_put(ct);
}
NF_CT_STAT_INC(net, drop);
return NF_DROP;
--
2.16.6
Hello Greg,
Can you please consider including the following patch in the stable
linux-5.4.y and linux-5.5.y?
This is to downgrade "EDAC skx: Can't get tolm/tohm" to debug level
on missing PCI device.
854bb48018d5("EDAC: skx_common: downgrade message importance
on missing PCI device")
Thanks,
Qian Lu
The following commit has been merged into the ras/urgent branch of tip:
Commit-ID: 59b5809655bdafb0767d3fd00a3e41711aab07e6
Gitweb: https://git.kernel.org/tip/59b5809655bdafb0767d3fd00a3e41711aab07e6
Author: Tony Luck <tony.luck(a)intel.com>
AuthorDate: Tue, 25 Feb 2020 17:17:37 -08:00
Committer: Borislav Petkov <bp(a)suse.de>
CommitterDate: Thu, 27 Feb 2020 21:36:42 +01:00
x86/mce: Fix logic and comments around MSR_PPIN_CTL
There are two implemented bits in the PPIN_CTL MSR:
Bit 0: LockOut (R/WO)
Set 1 to prevent further writes to MSR_PPIN_CTL.
Bit 1: Enable_PPIN (R/W)
If 1, enables MSR_PPIN to be accessible using RDMSR.
If 0, an attempt to read MSR_PPIN will cause #GP.
So there are four defined values:
0: PPIN is disabled, PPIN_CTL may be updated
1: PPIN is disabled. PPIN_CTL is locked against updates
2: PPIN is enabled. PPIN_CTL may be updated
3: PPIN is enabled. PPIN_CTL is locked against updates
Code would only enable the X86_FEATURE_INTEL_PPIN feature for case "2".
When it should have done so for both case "2" and case "3".
Fix the final test to just check for the enable bit. Also fix some of
the other comments in this function.
Fixes: 3f5a7896a509 ("x86/mce: Include the PPIN in MCE records when available")
Signed-off-by: Tony Luck <tony.luck(a)intel.com>
Signed-off-by: Borislav Petkov <bp(a)suse.de>
Cc: <stable(a)vger.kernel.org>
Link: https://lkml.kernel.org/r/20200226011737.9958-1-tony.luck@intel.com
---
arch/x86/kernel/cpu/mce/intel.c | 9 +++++----
1 file changed, 5 insertions(+), 4 deletions(-)
diff --git a/arch/x86/kernel/cpu/mce/intel.c b/arch/x86/kernel/cpu/mce/intel.c
index 5627b10..f996ffb 100644
--- a/arch/x86/kernel/cpu/mce/intel.c
+++ b/arch/x86/kernel/cpu/mce/intel.c
@@ -493,17 +493,18 @@ static void intel_ppin_init(struct cpuinfo_x86 *c)
return;
if ((val & 3UL) == 1UL) {
- /* PPIN available but disabled: */
+ /* PPIN locked in disabled mode */
return;
}
- /* If PPIN is disabled, but not locked, try to enable: */
- if (!(val & 3UL)) {
+ /* If PPIN is disabled, try to enable */
+ if (!(val & 2UL)) {
wrmsrl_safe(MSR_PPIN_CTL, val | 2UL);
rdmsrl_safe(MSR_PPIN_CTL, &val);
}
- if ((val & 3UL) == 2UL)
+ /* Is the enable bit set? */
+ if (val & 2UL)
set_cpu_cap(c, X86_FEATURE_INTEL_PPIN);
}
}
The following commit has been merged into the x86/urgent branch of tip:
Commit-ID: 735a6dd02222d8d070c7bb748f25895239ca8c92
Gitweb: https://git.kernel.org/tip/735a6dd02222d8d070c7bb748f25895239ca8c92
Author: Sean Christopherson <sean.j.christopherson(a)intel.com>
AuthorDate: Wed, 26 Feb 2020 15:16:15 -08:00
Committer: Borislav Petkov <bp(a)suse.de>
CommitterDate: Thu, 27 Feb 2020 19:02:45 +01:00
x86/pkeys: Manually set X86_FEATURE_OSPKE to preserve existing changes
Explicitly set X86_FEATURE_OSPKE via set_cpu_cap() instead of calling
get_cpu_cap() to pull the feature bit from CPUID after enabling CR4.PKE.
Invoking get_cpu_cap() effectively wipes out any {set,clear}_cpu_cap()
changes that were made between this_cpu->c_init() and setup_pku(), as
all non-synthetic feature words are reinitialized from the CPU's CPUID
values.
Blasting away capability updates manifests most visibility when running
on a VMX capable CPU, but with VMX disabled by BIOS. To indicate that
VMX is disabled, init_ia32_feat_ctl() clears X86_FEATURE_VMX, using
clear_cpu_cap() instead of setup_clear_cpu_cap() so that KVM can report
which CPU is misconfigured (KVM needs to probe every CPU anyways).
Restoring X86_FEATURE_VMX from CPUID causes KVM to think VMX is enabled,
ultimately leading to an unexpected #GP when KVM attempts to do VMXON.
Arguably, init_ia32_feat_ctl() should use setup_clear_cpu_cap() and let
KVM figure out a different way to report the misconfigured CPU, but VMX
is not the only feature bit that is affected, i.e. there is precedent
that tweaking feature bits via {set,clear}_cpu_cap() after ->c_init()
is expected to work. Most notably, x86_init_rdrand()'s clearing of
X86_FEATURE_RDRAND when RDRAND malfunctions is also overwritten.
Fixes: 0697694564c8 ("x86/mm/pkeys: Actually enable Memory Protection Keys in the CPU")
Reported-by: Jacob Keller <jacob.e.keller(a)intel.com>
Signed-off-by: Sean Christopherson <sean.j.christopherson(a)intel.com>
Signed-off-by: Borislav Petkov <bp(a)suse.de>
Acked-by: Dave Hansen <dave.hansen(a)linux.intel.com>
Tested-by: Jacob Keller <jacob.e.keller(a)intel.com>
Cc: stable(a)vger.kernel.org
Link: https://lkml.kernel.org/r/20200226231615.13664-1-sean.j.christopherson@inte…
---
arch/x86/kernel/cpu/common.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c
index 52c9bfb..4cdb123 100644
--- a/arch/x86/kernel/cpu/common.c
+++ b/arch/x86/kernel/cpu/common.c
@@ -445,7 +445,7 @@ static __always_inline void setup_pku(struct cpuinfo_x86 *c)
* cpuid bit to be set. We need to ensure that we
* update that bit in this CPU's "cpu_info".
*/
- get_cpu_cap(c);
+ set_cpu_cap(c, X86_FEATURE_OSPKE);
}
#ifdef CONFIG_X86_INTEL_MEMORY_PROTECTION_KEYS
The patch below does not apply to the 5.5-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
>From c6f8c440441029d5621ee5153676243234a4b76e Mon Sep 17 00:00:00 2001
From: Bhawanpreet Lakha <Bhawanpreet.Lakha(a)amd.com>
Date: Fri, 7 Feb 2020 10:41:20 -0500
Subject: [PATCH] drm/amd/display: fix dtm unloading
there was a type in the terminate command.
We should be calling psp_dtm_unload() instead of psp_hdcp_unload()
Fixes: 143f23053333 ("drm/amdgpu: psp DTM init")
Signed-off-by: Bhawanpreet Lakha <Bhawanpreet.Lakha(a)amd.com>
Reviewed-by: Feifei Xu <Feifei.Xu(a)amd.com>
Signed-off-by: Alex Deucher <alexander.deucher(a)amd.com>
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c
index 3a1570dafe34..146f96661b6b 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c
@@ -1013,6 +1013,30 @@ static int psp_dtm_initialize(struct psp_context *psp)
return 0;
}
+static int psp_dtm_unload(struct psp_context *psp)
+{
+ int ret;
+ struct psp_gfx_cmd_resp *cmd;
+
+ /*
+ * TODO: bypass the unloading in sriov for now
+ */
+ if (amdgpu_sriov_vf(psp->adev))
+ return 0;
+
+ cmd = kzalloc(sizeof(struct psp_gfx_cmd_resp), GFP_KERNEL);
+ if (!cmd)
+ return -ENOMEM;
+
+ psp_prep_ta_unload_cmd_buf(cmd, psp->dtm_context.session_id);
+
+ ret = psp_cmd_submit_buf(psp, NULL, cmd, psp->fence_buf_mc_addr);
+
+ kfree(cmd);
+
+ return ret;
+}
+
int psp_dtm_invoke(struct psp_context *psp, uint32_t ta_cmd_id)
{
/*
@@ -1037,7 +1061,7 @@ static int psp_dtm_terminate(struct psp_context *psp)
if (!psp->dtm_context.dtm_initialized)
return 0;
- ret = psp_hdcp_unload(psp);
+ ret = psp_dtm_unload(psp);
if (ret)
return ret;
Commit 2c361684803e ("PM / Domains: Don't treat zero found compatible idle
states as an error"), moved of_genpd_parse_idle_states() towards allowing
none compatible idle state to be found for the device node, rather than
returning an error code.
However, it didn't consider that the "domain-idle-states" DT property may
be missing as it's optional, which makes of_count_phandle_with_args() to
return -ENOENT. Let's fix this to make the behaviour consistent.
Reported-by: Benjamin Gaignard <benjamin.gaignard(a)st.com>
Fixes: 2c361684803e ("PM / Domains: Don't treat zero found compatible idle states as an error")
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Ulf Hansson <ulf.hansson(a)linaro.org>
---
drivers/base/power/domain.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/drivers/base/power/domain.c b/drivers/base/power/domain.c
index 959d6d5eb000..0a01df608849 100644
--- a/drivers/base/power/domain.c
+++ b/drivers/base/power/domain.c
@@ -2653,7 +2653,7 @@ static int genpd_iterate_idle_states(struct device_node *dn,
ret = of_count_phandle_with_args(dn, "domain-idle-states", NULL);
if (ret <= 0)
- return ret;
+ return ret == -ENOENT ? 0 : ret;
/* Loop over the phandles until all the requested entry is found */
of_for_each_phandle(&it, ret, dn, "domain-idle-states", NULL, 0) {
--
2.20.1
[ Upstream commit 7143b5ac5750f404ff3a594b34fdf3fc2f99f828 ]
This patch drops 'cur_mm' before calling cond_resched(), to prevent
the sq_thread from spinning even when the user process is finished.
Before this patch, if the user process ended without closing the
io_uring fd, the sq_thread continues to spin until the
'sq_thread_idle' timeout ends.
In the worst case where the 'sq_thread_idle' parameter is bigger than
INT_MAX, the sq_thread will spin forever.
Fixes: 6c271ce2f1d5 ("io_uring: add submission polling")
Signed-off-by: Stefano Garzarella <sgarzare(a)redhat.com>
Signed-off-by: Jens Axboe <axboe(a)kernel.dk>
---
fs/io_uring.c | 20 ++++++++++----------
1 file changed, 10 insertions(+), 10 deletions(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c
index 709671faaed6..33a95e1ffb15 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -2740,16 +2740,6 @@ static int io_sq_thread(void *data)
to_submit = io_sqring_entries(ctx);
if (!to_submit) {
- /*
- * We're polling. If we're within the defined idle
- * period, then let us spin without work before going
- * to sleep.
- */
- if (inflight || !time_after(jiffies, timeout)) {
- cond_resched();
- continue;
- }
-
/*
* Drop cur_mm before scheduling, we can't hold it for
* long periods (or over schedule()). Do this before
@@ -2762,6 +2752,16 @@ static int io_sq_thread(void *data)
cur_mm = NULL;
}
+ /*
+ * We're polling. If we're within the defined idle
+ * period, then let us spin without work before going
+ * to sleep.
+ */
+ if (inflight || !time_after(jiffies, timeout)) {
+ cond_resched();
+ continue;
+ }
+
prepare_to_wait(&ctx->sqo_wait, &wait,
TASK_INTERRUPTIBLE);
--
2.24.1
The patch below does not apply to the 5.4-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
>From 7143b5ac5750f404ff3a594b34fdf3fc2f99f828 Mon Sep 17 00:00:00 2001
From: Stefano Garzarella <sgarzare(a)redhat.com>
Date: Fri, 21 Feb 2020 16:42:16 +0100
Subject: [PATCH] io_uring: prevent sq_thread from spinning when it should stop
This patch drops 'cur_mm' before calling cond_resched(), to prevent
the sq_thread from spinning even when the user process is finished.
Before this patch, if the user process ended without closing the
io_uring fd, the sq_thread continues to spin until the
'sq_thread_idle' timeout ends.
In the worst case where the 'sq_thread_idle' parameter is bigger than
INT_MAX, the sq_thread will spin forever.
Fixes: 6c271ce2f1d5 ("io_uring: add submission polling")
Signed-off-by: Stefano Garzarella <sgarzare(a)redhat.com>
Signed-off-by: Jens Axboe <axboe(a)kernel.dk>
diff --git a/fs/io_uring.c b/fs/io_uring.c
index 6e249aa97ba3..b43467b3a8dc 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -5142,6 +5142,18 @@ static int io_sq_thread(void *data)
* to enter the kernel to reap and flush events.
*/
if (!to_submit || ret == -EBUSY) {
+ /*
+ * Drop cur_mm before scheduling, we can't hold it for
+ * long periods (or over schedule()). Do this before
+ * adding ourselves to the waitqueue, as the unuse/drop
+ * may sleep.
+ */
+ if (cur_mm) {
+ unuse_mm(cur_mm);
+ mmput(cur_mm);
+ cur_mm = NULL;
+ }
+
/*
* We're polling. If we're within the defined idle
* period, then let us spin without work before going
@@ -5156,18 +5168,6 @@ static int io_sq_thread(void *data)
continue;
}
- /*
- * Drop cur_mm before scheduling, we can't hold it for
- * long periods (or over schedule()). Do this before
- * adding ourselves to the waitqueue, as the unuse/drop
- * may sleep.
- */
- if (cur_mm) {
- unuse_mm(cur_mm);
- mmput(cur_mm);
- cur_mm = NULL;
- }
-
prepare_to_wait(&ctx->sqo_wait, &wait,
TASK_INTERRUPTIBLE);
The patch below does not apply to the 4.19-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
>From 305e519ce48e935702c32241f07d393c3c8fed3e Mon Sep 17 00:00:00 2001
From: Alexander Potapenko <glider(a)google.com>
Date: Thu, 20 Feb 2020 20:04:30 -0800
Subject: [PATCH] lib/stackdepot.c: fix global out-of-bounds in stack_slabs
Walter Wu has reported a potential case in which init_stack_slab() is
called after stack_slabs[STACK_ALLOC_MAX_SLABS - 1] has already been
initialized. In that case init_stack_slab() will overwrite
stack_slabs[STACK_ALLOC_MAX_SLABS], which may result in a memory
corruption.
Link: http://lkml.kernel.org/r/20200218102950.260263-1-glider@google.com
Fixes: cd11016e5f521 ("mm, kasan: stackdepot implementation. Enable stackdepot for SLAB")
Signed-off-by: Alexander Potapenko <glider(a)google.com>
Reported-by: Walter Wu <walter-zh.wu(a)mediatek.com>
Cc: Dmitry Vyukov <dvyukov(a)google.com>
Cc: Matthias Brugger <matthias.bgg(a)gmail.com>
Cc: Thomas Gleixner <tglx(a)linutronix.de>
Cc: Josh Poimboeuf <jpoimboe(a)redhat.com>
Cc: Kate Stewart <kstewart(a)linuxfoundation.org>
Cc: Greg Kroah-Hartman <gregkh(a)linuxfoundation.org>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds(a)linux-foundation.org>
diff --git a/lib/stackdepot.c b/lib/stackdepot.c
index ed717dd08ff3..81c69c08d1d1 100644
--- a/lib/stackdepot.c
+++ b/lib/stackdepot.c
@@ -83,15 +83,19 @@ static bool init_stack_slab(void **prealloc)
return true;
if (stack_slabs[depot_index] == NULL) {
stack_slabs[depot_index] = *prealloc;
+ *prealloc = NULL;
} else {
- stack_slabs[depot_index + 1] = *prealloc;
+ /* If this is the last depot slab, do not touch the next one. */
+ if (depot_index + 1 < STACK_ALLOC_MAX_SLABS) {
+ stack_slabs[depot_index + 1] = *prealloc;
+ *prealloc = NULL;
+ }
/*
* This smp_store_release pairs with smp_load_acquire() from
* |next_slab_inited| above and in stack_depot_save().
*/
smp_store_release(&next_slab_inited, 1);
}
- *prealloc = NULL;
return true;
}
Upstream: 82c69bf58650 ("drm/i915/gt: Detect if we miss WaIdleLiteRestore")
In order to avoid confusing the HW, we must never submit an empty ring
during lite-restore, that is we should always advance the RING_TAIL
before submitting to stay ahead of the RING_HEAD.
Normally this is prevented by keeping a couple of spare NOPs in the
request->wa_tail so that on resubmission we can advance the tail. This
relies on the request only being resubmitted once, which is the normal
condition as it is seen once for ELSP[1] and then later in ELSP[0]. On
preemption, the requests are unwound and the tail reset back to the
normal end point (as we know the request is incomplete and therefore its
RING_HEAD is even earlier).
However, if this w/a should fail we would try and resubmit the request
with the RING_TAIL already set to the location of this request's wa_tail
potentially causing a GPU hang. We can spot when we do try and
incorrectly resubmit without advancing the RING_TAIL and spare any
embarrassment by forcing the context restore.
In the case of preempt-to-busy, we leave the requests running on the HW
while we unwind. As the ring is still live, we cannot rewind our
rq->tail without forcing a reload so leave it set to rq->wa_tail and
only force a reload if we resubmit after a lite-restore. (Normally, the
forced reload will be a part of the preemption event.)
Fixes: 22b7a426bbe1 ("drm/i915/execlists: Preempt-to-busy")
Closes: https://gitlab.freedesktop.org/drm/intel/issues/673
Signed-off-by: Chris Wilson <chris(a)chris-wilson.co.uk>
Cc: Mika Kuoppala <mika.kuoppala(a)linux.intel.com>
Cc: Tvrtko Ursulin <tvrtko.ursulin(a)intel.com>
Reviewed-by: Tvrtko Ursulin <tvrtko.ursulin(a)intel.com>
Cc: stable(a)kernel.vger.org #5.3+
Link: https://patchwork.freedesktop.org/patch/msgid/20191209023215.3519970-1-chri…
---
drivers/gpu/drm/i915/gt/intel_lrc.c | 42 ++++++++++++++---------------
1 file changed, 20 insertions(+), 22 deletions(-)
diff --git a/drivers/gpu/drm/i915/gt/intel_lrc.c b/drivers/gpu/drm/i915/gt/intel_lrc.c
index 4949b5ad860f..e0e4f3deb2da 100644
--- a/drivers/gpu/drm/i915/gt/intel_lrc.c
+++ b/drivers/gpu/drm/i915/gt/intel_lrc.c
@@ -471,12 +471,6 @@ lrc_descriptor(struct intel_context *ce, struct intel_engine_cs *engine)
return desc;
}
-static void unwind_wa_tail(struct i915_request *rq)
-{
- rq->tail = intel_ring_wrap(rq->ring, rq->wa_tail - WA_TAIL_BYTES);
- assert_ring_tail_valid(rq->ring, rq->tail);
-}
-
static struct i915_request *
__unwind_incomplete_requests(struct intel_engine_cs *engine)
{
@@ -495,7 +489,6 @@ __unwind_incomplete_requests(struct intel_engine_cs *engine)
continue; /* XXX */
__i915_request_unsubmit(rq);
- unwind_wa_tail(rq);
/*
* Push the request back into the queue for later resubmission.
@@ -650,13 +643,29 @@ execlists_schedule_out(struct i915_request *rq)
i915_request_put(rq);
}
-static u64 execlists_update_context(const struct i915_request *rq)
+static u64 execlists_update_context(struct i915_request *rq)
{
struct intel_context *ce = rq->hw_context;
- u64 desc;
+ u64 desc = ce->lrc_desc;
+ u32 tail;
- ce->lrc_reg_state[CTX_RING_TAIL + 1] =
- intel_ring_set_tail(rq->ring, rq->tail);
+ /*
+ * WaIdleLiteRestore:bdw,skl
+ *
+ * We should never submit the context with the same RING_TAIL twice
+ * just in case we submit an empty ring, which confuses the HW.
+ *
+ * We append a couple of NOOPs (gen8_emit_wa_tail) after the end of
+ * the normal request to be able to always advance the RING_TAIL on
+ * subsequent resubmissions (for lite restore). Should that fail us,
+ * and we try and submit the same tail again, force the context
+ * reload.
+ */
+ tail = intel_ring_set_tail(rq->ring, rq->tail);
+ if (unlikely(ce->lrc_reg_state[CTX_RING_TAIL + 1] == tail))
+ desc |= CTX_DESC_FORCE_RESTORE;
+ ce->lrc_reg_state[CTX_RING_TAIL + 1] = tail;
+ rq->tail = rq->wa_tail;
/*
* Make sure the context image is complete before we submit it to HW.
@@ -675,7 +684,6 @@ static u64 execlists_update_context(const struct i915_request *rq)
*/
mb();
- desc = ce->lrc_desc;
ce->lrc_desc &= ~CTX_DESC_FORCE_RESTORE;
return desc;
@@ -1150,16 +1158,6 @@ static void execlists_dequeue(struct intel_engine_cs *engine)
if (!list_is_last(&last->sched.link,
&engine->active.requests))
return;
-
- /*
- * WaIdleLiteRestore:bdw,skl
- * Apply the wa NOOPs to prevent
- * ring:HEAD == rq:TAIL as we resubmit the
- * request. See gen8_emit_fini_breadcrumb() for
- * where we prepare the padding after the
- * end of the request.
- */
- last->tail = last->wa_tail;
}
}
--
2.25.1
Hi,
I was just looking throught the current 5.4-stable queue and saw that
28553fa992cb28 ('Btrfs: fix race between shrinking truncate and fiemap')
is queued. Upstream has a follow-up fix for this: 52e29e331070cd aka
'btrfs: don't set path->leave_spinning for truncate'.
Would be nice to get those in together. I only looked at 5.4, don't
know about other queues.
thanks,
Holger
The patch below does not apply to the 4.4-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
>From a5ae50dea9111db63d30d700766dd5509602f7ad Mon Sep 17 00:00:00 2001
From: Filipe Manana <fdmanana(a)suse.com>
Date: Thu, 20 Feb 2020 13:29:49 +0000
Subject: [PATCH] Btrfs: fix deadlock during fast fsync when logging prealloc
extents beyond eof
While logging the prealloc extents of an inode during a fast fsync we call
btrfs_truncate_inode_items(), through btrfs_log_prealloc_extents(), while
holding a read lock on a leaf of the inode's root (not the log root, the
fs/subvol root), and then that function locks the file range in the inode's
iotree. This can lead to a deadlock when:
* the fsync is ranged
* the file has prealloc extents beyond eof
* writeback for a range different from the fsync range starts
during the fsync
* the size of the file is not sector size aligned
Because when finishing an ordered extent we lock first a file range and
then try to COW the fs/subvol tree to insert an extent item.
The following diagram shows how the deadlock can happen.
CPU 1 CPU 2
btrfs_sync_file()
--> for range [0, 1MiB)
--> inode has a size of
1MiB and has 1 prealloc
extent beyond the
i_size, starting at offset
4MiB
flushes all delalloc for the
range [0MiB, 1MiB) and waits
for the respective ordered
extents to complete
--> before task at CPU 1 locks the
inode, a write into file range
[1MiB, 2MiB + 1KiB) is made
--> i_size is updated to 2MiB + 1KiB
--> writeback is started for that
range, [1MiB, 2MiB + 4KiB)
--> end offset rounded up to
be sector size aligned
btrfs_log_dentry_safe()
btrfs_log_inode_parent()
btrfs_log_inode()
btrfs_log_changed_extents()
btrfs_log_prealloc_extents()
--> does a search on the
inode's root
--> holds a read lock on
leaf X
btrfs_finish_ordered_io()
--> locks range [1MiB, 2MiB + 4KiB)
--> end offset rounded up
to be sector size aligned
--> tries to cow leaf X, through
insert_reserved_file_extent()
--> already locked by the
task at CPU 1
btrfs_truncate_inode_items()
--> gets an i_size of
2MiB + 1KiB, which is
not sector size
aligned
--> tries to lock file
range [2MiB, (u64)-1)
--> the start range
is rounded down
from 2MiB + 1K
to 2MiB to be sector
size aligned
--> but the subrange
[2MiB, 2MiB + 4KiB) is
already locked by
task at CPU 2 which
is waiting to get a
write lock on leaf X
for which we are
holding a read lock
*** deadlock ***
This results in a stack trace like the following, triggered by test case
generic/561 from fstests:
[ 2779.973608] INFO: task kworker/u8:6:247 blocked for more than 120 seconds.
[ 2779.979536] Not tainted 5.6.0-rc2-btrfs-next-53 #1
[ 2779.984503] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 2779.990136] kworker/u8:6 D 0 247 2 0x80004000
[ 2779.990457] Workqueue: btrfs-endio-write btrfs_work_helper [btrfs]
[ 2779.990466] Call Trace:
[ 2779.990491] ? __schedule+0x384/0xa30
[ 2779.990521] schedule+0x33/0xe0
[ 2779.990616] btrfs_tree_read_lock+0x19e/0x2e0 [btrfs]
[ 2779.990632] ? remove_wait_queue+0x60/0x60
[ 2779.990730] btrfs_read_lock_root_node+0x2f/0x40 [btrfs]
[ 2779.990782] btrfs_search_slot+0x510/0x1000 [btrfs]
[ 2779.990869] btrfs_lookup_file_extent+0x4a/0x70 [btrfs]
[ 2779.990944] __btrfs_drop_extents+0x161/0x1060 [btrfs]
[ 2779.990987] ? mark_held_locks+0x6d/0xc0
[ 2779.990994] ? __slab_alloc.isra.49+0x99/0x100
[ 2779.991060] ? insert_reserved_file_extent.constprop.19+0x64/0x300 [btrfs]
[ 2779.991145] insert_reserved_file_extent.constprop.19+0x97/0x300 [btrfs]
[ 2779.991222] ? start_transaction+0xdd/0x5c0 [btrfs]
[ 2779.991291] btrfs_finish_ordered_io+0x4f4/0x840 [btrfs]
[ 2779.991405] btrfs_work_helper+0xaa/0x720 [btrfs]
[ 2779.991432] process_one_work+0x26d/0x6a0
[ 2779.991460] worker_thread+0x4f/0x3e0
[ 2779.991481] ? process_one_work+0x6a0/0x6a0
[ 2779.991489] kthread+0x103/0x140
[ 2779.991499] ? kthread_create_worker_on_cpu+0x70/0x70
[ 2779.991515] ret_from_fork+0x3a/0x50
(...)
[ 2780.026211] INFO: task fsstress:17375 blocked for more than 120 seconds.
[ 2780.027480] Not tainted 5.6.0-rc2-btrfs-next-53 #1
[ 2780.028482] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 2780.030035] fsstress D 0 17375 17373 0x00004000
[ 2780.030038] Call Trace:
[ 2780.030044] ? __schedule+0x384/0xa30
[ 2780.030052] schedule+0x33/0xe0
[ 2780.030075] lock_extent_bits+0x20c/0x320 [btrfs]
[ 2780.030094] ? btrfs_truncate_inode_items+0xf4/0x1150 [btrfs]
[ 2780.030098] ? rcu_read_lock_sched_held+0x59/0xa0
[ 2780.030102] ? remove_wait_queue+0x60/0x60
[ 2780.030122] btrfs_truncate_inode_items+0x133/0x1150 [btrfs]
[ 2780.030151] ? btrfs_set_path_blocking+0xb2/0x160 [btrfs]
[ 2780.030165] ? btrfs_search_slot+0x379/0x1000 [btrfs]
[ 2780.030195] btrfs_log_changed_extents.isra.8+0x841/0x93e [btrfs]
[ 2780.030202] ? do_raw_spin_unlock+0x49/0xc0
[ 2780.030215] ? btrfs_get_num_csums+0x10/0x10 [btrfs]
[ 2780.030239] btrfs_log_inode+0xf83/0x1124 [btrfs]
[ 2780.030251] ? __mutex_unlock_slowpath+0x45/0x2a0
[ 2780.030275] btrfs_log_inode_parent+0x2a0/0xe40 [btrfs]
[ 2780.030282] ? dget_parent+0xa1/0x370
[ 2780.030309] btrfs_log_dentry_safe+0x4a/0x70 [btrfs]
[ 2780.030329] btrfs_sync_file+0x3f3/0x490 [btrfs]
[ 2780.030339] do_fsync+0x38/0x60
[ 2780.030343] __x64_sys_fdatasync+0x13/0x20
[ 2780.030345] do_syscall_64+0x5c/0x280
[ 2780.030348] entry_SYSCALL_64_after_hwframe+0x49/0xbe
[ 2780.030356] RIP: 0033:0x7f2d80f6d5f0
[ 2780.030361] Code: Bad RIP value.
[ 2780.030362] RSP: 002b:00007ffdba3c8548 EFLAGS: 00000246 ORIG_RAX: 000000000000004b
[ 2780.030364] RAX: ffffffffffffffda RBX: 0000000000000003 RCX: 00007f2d80f6d5f0
[ 2780.030365] RDX: 00007ffdba3c84b0 RSI: 00007ffdba3c84b0 RDI: 0000000000000003
[ 2780.030367] RBP: 000000000000004a R08: 0000000000000001 R09: 00007ffdba3c855c
[ 2780.030368] R10: 0000000000000078 R11: 0000000000000246 R12: 00000000000001f4
[ 2780.030369] R13: 0000000051eb851f R14: 00007ffdba3c85f0 R15: 0000557a49220d90
So fix this by making btrfs_truncate_inode_items() not lock the range in
the inode's iotree when the target root is a log root, since it's not
needed to lock the range for log roots as the protection from the inode's
lock and log_mutex are all that's needed.
Fixes: 28553fa992cb28 ("Btrfs: fix race between shrinking truncate and fiemap")
CC: stable(a)vger.kernel.org # 4.4+
Reviewed-by: Josef Bacik <josef(a)toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana(a)suse.com>
Signed-off-by: David Sterba <dsterba(a)suse.com>
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 4f47ba652b31..1ccb3f8d528d 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -4103,8 +4103,9 @@ int btrfs_truncate_inode_items(struct btrfs_trans_handle *trans,
return -ENOMEM;
path->reada = READA_BACK;
- lock_extent_bits(&BTRFS_I(inode)->io_tree, lock_start, (u64)-1,
- &cached_state);
+ if (root->root_key.objectid != BTRFS_TREE_LOG_OBJECTID)
+ lock_extent_bits(&BTRFS_I(inode)->io_tree, lock_start, (u64)-1,
+ &cached_state);
/*
* We want to drop from the next block forward in case this new size is
@@ -4368,11 +4369,10 @@ int btrfs_truncate_inode_items(struct btrfs_trans_handle *trans,
if (!ret && last_size > new_size)
last_size = new_size;
btrfs_ordered_update_i_size(inode, last_size, NULL);
+ unlock_extent_cached(&BTRFS_I(inode)->io_tree, lock_start,
+ (u64)-1, &cached_state);
}
- unlock_extent_cached(&BTRFS_I(inode)->io_tree, lock_start, (u64)-1,
- &cached_state);
-
btrfs_free_path(path);
return ret;
}
The patch below does not apply to the 4.9-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
>From a5ae50dea9111db63d30d700766dd5509602f7ad Mon Sep 17 00:00:00 2001
From: Filipe Manana <fdmanana(a)suse.com>
Date: Thu, 20 Feb 2020 13:29:49 +0000
Subject: [PATCH] Btrfs: fix deadlock during fast fsync when logging prealloc
extents beyond eof
While logging the prealloc extents of an inode during a fast fsync we call
btrfs_truncate_inode_items(), through btrfs_log_prealloc_extents(), while
holding a read lock on a leaf of the inode's root (not the log root, the
fs/subvol root), and then that function locks the file range in the inode's
iotree. This can lead to a deadlock when:
* the fsync is ranged
* the file has prealloc extents beyond eof
* writeback for a range different from the fsync range starts
during the fsync
* the size of the file is not sector size aligned
Because when finishing an ordered extent we lock first a file range and
then try to COW the fs/subvol tree to insert an extent item.
The following diagram shows how the deadlock can happen.
CPU 1 CPU 2
btrfs_sync_file()
--> for range [0, 1MiB)
--> inode has a size of
1MiB and has 1 prealloc
extent beyond the
i_size, starting at offset
4MiB
flushes all delalloc for the
range [0MiB, 1MiB) and waits
for the respective ordered
extents to complete
--> before task at CPU 1 locks the
inode, a write into file range
[1MiB, 2MiB + 1KiB) is made
--> i_size is updated to 2MiB + 1KiB
--> writeback is started for that
range, [1MiB, 2MiB + 4KiB)
--> end offset rounded up to
be sector size aligned
btrfs_log_dentry_safe()
btrfs_log_inode_parent()
btrfs_log_inode()
btrfs_log_changed_extents()
btrfs_log_prealloc_extents()
--> does a search on the
inode's root
--> holds a read lock on
leaf X
btrfs_finish_ordered_io()
--> locks range [1MiB, 2MiB + 4KiB)
--> end offset rounded up
to be sector size aligned
--> tries to cow leaf X, through
insert_reserved_file_extent()
--> already locked by the
task at CPU 1
btrfs_truncate_inode_items()
--> gets an i_size of
2MiB + 1KiB, which is
not sector size
aligned
--> tries to lock file
range [2MiB, (u64)-1)
--> the start range
is rounded down
from 2MiB + 1K
to 2MiB to be sector
size aligned
--> but the subrange
[2MiB, 2MiB + 4KiB) is
already locked by
task at CPU 2 which
is waiting to get a
write lock on leaf X
for which we are
holding a read lock
*** deadlock ***
This results in a stack trace like the following, triggered by test case
generic/561 from fstests:
[ 2779.973608] INFO: task kworker/u8:6:247 blocked for more than 120 seconds.
[ 2779.979536] Not tainted 5.6.0-rc2-btrfs-next-53 #1
[ 2779.984503] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 2779.990136] kworker/u8:6 D 0 247 2 0x80004000
[ 2779.990457] Workqueue: btrfs-endio-write btrfs_work_helper [btrfs]
[ 2779.990466] Call Trace:
[ 2779.990491] ? __schedule+0x384/0xa30
[ 2779.990521] schedule+0x33/0xe0
[ 2779.990616] btrfs_tree_read_lock+0x19e/0x2e0 [btrfs]
[ 2779.990632] ? remove_wait_queue+0x60/0x60
[ 2779.990730] btrfs_read_lock_root_node+0x2f/0x40 [btrfs]
[ 2779.990782] btrfs_search_slot+0x510/0x1000 [btrfs]
[ 2779.990869] btrfs_lookup_file_extent+0x4a/0x70 [btrfs]
[ 2779.990944] __btrfs_drop_extents+0x161/0x1060 [btrfs]
[ 2779.990987] ? mark_held_locks+0x6d/0xc0
[ 2779.990994] ? __slab_alloc.isra.49+0x99/0x100
[ 2779.991060] ? insert_reserved_file_extent.constprop.19+0x64/0x300 [btrfs]
[ 2779.991145] insert_reserved_file_extent.constprop.19+0x97/0x300 [btrfs]
[ 2779.991222] ? start_transaction+0xdd/0x5c0 [btrfs]
[ 2779.991291] btrfs_finish_ordered_io+0x4f4/0x840 [btrfs]
[ 2779.991405] btrfs_work_helper+0xaa/0x720 [btrfs]
[ 2779.991432] process_one_work+0x26d/0x6a0
[ 2779.991460] worker_thread+0x4f/0x3e0
[ 2779.991481] ? process_one_work+0x6a0/0x6a0
[ 2779.991489] kthread+0x103/0x140
[ 2779.991499] ? kthread_create_worker_on_cpu+0x70/0x70
[ 2779.991515] ret_from_fork+0x3a/0x50
(...)
[ 2780.026211] INFO: task fsstress:17375 blocked for more than 120 seconds.
[ 2780.027480] Not tainted 5.6.0-rc2-btrfs-next-53 #1
[ 2780.028482] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 2780.030035] fsstress D 0 17375 17373 0x00004000
[ 2780.030038] Call Trace:
[ 2780.030044] ? __schedule+0x384/0xa30
[ 2780.030052] schedule+0x33/0xe0
[ 2780.030075] lock_extent_bits+0x20c/0x320 [btrfs]
[ 2780.030094] ? btrfs_truncate_inode_items+0xf4/0x1150 [btrfs]
[ 2780.030098] ? rcu_read_lock_sched_held+0x59/0xa0
[ 2780.030102] ? remove_wait_queue+0x60/0x60
[ 2780.030122] btrfs_truncate_inode_items+0x133/0x1150 [btrfs]
[ 2780.030151] ? btrfs_set_path_blocking+0xb2/0x160 [btrfs]
[ 2780.030165] ? btrfs_search_slot+0x379/0x1000 [btrfs]
[ 2780.030195] btrfs_log_changed_extents.isra.8+0x841/0x93e [btrfs]
[ 2780.030202] ? do_raw_spin_unlock+0x49/0xc0
[ 2780.030215] ? btrfs_get_num_csums+0x10/0x10 [btrfs]
[ 2780.030239] btrfs_log_inode+0xf83/0x1124 [btrfs]
[ 2780.030251] ? __mutex_unlock_slowpath+0x45/0x2a0
[ 2780.030275] btrfs_log_inode_parent+0x2a0/0xe40 [btrfs]
[ 2780.030282] ? dget_parent+0xa1/0x370
[ 2780.030309] btrfs_log_dentry_safe+0x4a/0x70 [btrfs]
[ 2780.030329] btrfs_sync_file+0x3f3/0x490 [btrfs]
[ 2780.030339] do_fsync+0x38/0x60
[ 2780.030343] __x64_sys_fdatasync+0x13/0x20
[ 2780.030345] do_syscall_64+0x5c/0x280
[ 2780.030348] entry_SYSCALL_64_after_hwframe+0x49/0xbe
[ 2780.030356] RIP: 0033:0x7f2d80f6d5f0
[ 2780.030361] Code: Bad RIP value.
[ 2780.030362] RSP: 002b:00007ffdba3c8548 EFLAGS: 00000246 ORIG_RAX: 000000000000004b
[ 2780.030364] RAX: ffffffffffffffda RBX: 0000000000000003 RCX: 00007f2d80f6d5f0
[ 2780.030365] RDX: 00007ffdba3c84b0 RSI: 00007ffdba3c84b0 RDI: 0000000000000003
[ 2780.030367] RBP: 000000000000004a R08: 0000000000000001 R09: 00007ffdba3c855c
[ 2780.030368] R10: 0000000000000078 R11: 0000000000000246 R12: 00000000000001f4
[ 2780.030369] R13: 0000000051eb851f R14: 00007ffdba3c85f0 R15: 0000557a49220d90
So fix this by making btrfs_truncate_inode_items() not lock the range in
the inode's iotree when the target root is a log root, since it's not
needed to lock the range for log roots as the protection from the inode's
lock and log_mutex are all that's needed.
Fixes: 28553fa992cb28 ("Btrfs: fix race between shrinking truncate and fiemap")
CC: stable(a)vger.kernel.org # 4.4+
Reviewed-by: Josef Bacik <josef(a)toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana(a)suse.com>
Signed-off-by: David Sterba <dsterba(a)suse.com>
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 4f47ba652b31..1ccb3f8d528d 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -4103,8 +4103,9 @@ int btrfs_truncate_inode_items(struct btrfs_trans_handle *trans,
return -ENOMEM;
path->reada = READA_BACK;
- lock_extent_bits(&BTRFS_I(inode)->io_tree, lock_start, (u64)-1,
- &cached_state);
+ if (root->root_key.objectid != BTRFS_TREE_LOG_OBJECTID)
+ lock_extent_bits(&BTRFS_I(inode)->io_tree, lock_start, (u64)-1,
+ &cached_state);
/*
* We want to drop from the next block forward in case this new size is
@@ -4368,11 +4369,10 @@ int btrfs_truncate_inode_items(struct btrfs_trans_handle *trans,
if (!ret && last_size > new_size)
last_size = new_size;
btrfs_ordered_update_i_size(inode, last_size, NULL);
+ unlock_extent_cached(&BTRFS_I(inode)->io_tree, lock_start,
+ (u64)-1, &cached_state);
}
- unlock_extent_cached(&BTRFS_I(inode)->io_tree, lock_start, (u64)-1,
- &cached_state);
-
btrfs_free_path(path);
return ret;
}
The patch below does not apply to the 4.14-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
>From a5ae50dea9111db63d30d700766dd5509602f7ad Mon Sep 17 00:00:00 2001
From: Filipe Manana <fdmanana(a)suse.com>
Date: Thu, 20 Feb 2020 13:29:49 +0000
Subject: [PATCH] Btrfs: fix deadlock during fast fsync when logging prealloc
extents beyond eof
While logging the prealloc extents of an inode during a fast fsync we call
btrfs_truncate_inode_items(), through btrfs_log_prealloc_extents(), while
holding a read lock on a leaf of the inode's root (not the log root, the
fs/subvol root), and then that function locks the file range in the inode's
iotree. This can lead to a deadlock when:
* the fsync is ranged
* the file has prealloc extents beyond eof
* writeback for a range different from the fsync range starts
during the fsync
* the size of the file is not sector size aligned
Because when finishing an ordered extent we lock first a file range and
then try to COW the fs/subvol tree to insert an extent item.
The following diagram shows how the deadlock can happen.
CPU 1 CPU 2
btrfs_sync_file()
--> for range [0, 1MiB)
--> inode has a size of
1MiB and has 1 prealloc
extent beyond the
i_size, starting at offset
4MiB
flushes all delalloc for the
range [0MiB, 1MiB) and waits
for the respective ordered
extents to complete
--> before task at CPU 1 locks the
inode, a write into file range
[1MiB, 2MiB + 1KiB) is made
--> i_size is updated to 2MiB + 1KiB
--> writeback is started for that
range, [1MiB, 2MiB + 4KiB)
--> end offset rounded up to
be sector size aligned
btrfs_log_dentry_safe()
btrfs_log_inode_parent()
btrfs_log_inode()
btrfs_log_changed_extents()
btrfs_log_prealloc_extents()
--> does a search on the
inode's root
--> holds a read lock on
leaf X
btrfs_finish_ordered_io()
--> locks range [1MiB, 2MiB + 4KiB)
--> end offset rounded up
to be sector size aligned
--> tries to cow leaf X, through
insert_reserved_file_extent()
--> already locked by the
task at CPU 1
btrfs_truncate_inode_items()
--> gets an i_size of
2MiB + 1KiB, which is
not sector size
aligned
--> tries to lock file
range [2MiB, (u64)-1)
--> the start range
is rounded down
from 2MiB + 1K
to 2MiB to be sector
size aligned
--> but the subrange
[2MiB, 2MiB + 4KiB) is
already locked by
task at CPU 2 which
is waiting to get a
write lock on leaf X
for which we are
holding a read lock
*** deadlock ***
This results in a stack trace like the following, triggered by test case
generic/561 from fstests:
[ 2779.973608] INFO: task kworker/u8:6:247 blocked for more than 120 seconds.
[ 2779.979536] Not tainted 5.6.0-rc2-btrfs-next-53 #1
[ 2779.984503] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 2779.990136] kworker/u8:6 D 0 247 2 0x80004000
[ 2779.990457] Workqueue: btrfs-endio-write btrfs_work_helper [btrfs]
[ 2779.990466] Call Trace:
[ 2779.990491] ? __schedule+0x384/0xa30
[ 2779.990521] schedule+0x33/0xe0
[ 2779.990616] btrfs_tree_read_lock+0x19e/0x2e0 [btrfs]
[ 2779.990632] ? remove_wait_queue+0x60/0x60
[ 2779.990730] btrfs_read_lock_root_node+0x2f/0x40 [btrfs]
[ 2779.990782] btrfs_search_slot+0x510/0x1000 [btrfs]
[ 2779.990869] btrfs_lookup_file_extent+0x4a/0x70 [btrfs]
[ 2779.990944] __btrfs_drop_extents+0x161/0x1060 [btrfs]
[ 2779.990987] ? mark_held_locks+0x6d/0xc0
[ 2779.990994] ? __slab_alloc.isra.49+0x99/0x100
[ 2779.991060] ? insert_reserved_file_extent.constprop.19+0x64/0x300 [btrfs]
[ 2779.991145] insert_reserved_file_extent.constprop.19+0x97/0x300 [btrfs]
[ 2779.991222] ? start_transaction+0xdd/0x5c0 [btrfs]
[ 2779.991291] btrfs_finish_ordered_io+0x4f4/0x840 [btrfs]
[ 2779.991405] btrfs_work_helper+0xaa/0x720 [btrfs]
[ 2779.991432] process_one_work+0x26d/0x6a0
[ 2779.991460] worker_thread+0x4f/0x3e0
[ 2779.991481] ? process_one_work+0x6a0/0x6a0
[ 2779.991489] kthread+0x103/0x140
[ 2779.991499] ? kthread_create_worker_on_cpu+0x70/0x70
[ 2779.991515] ret_from_fork+0x3a/0x50
(...)
[ 2780.026211] INFO: task fsstress:17375 blocked for more than 120 seconds.
[ 2780.027480] Not tainted 5.6.0-rc2-btrfs-next-53 #1
[ 2780.028482] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 2780.030035] fsstress D 0 17375 17373 0x00004000
[ 2780.030038] Call Trace:
[ 2780.030044] ? __schedule+0x384/0xa30
[ 2780.030052] schedule+0x33/0xe0
[ 2780.030075] lock_extent_bits+0x20c/0x320 [btrfs]
[ 2780.030094] ? btrfs_truncate_inode_items+0xf4/0x1150 [btrfs]
[ 2780.030098] ? rcu_read_lock_sched_held+0x59/0xa0
[ 2780.030102] ? remove_wait_queue+0x60/0x60
[ 2780.030122] btrfs_truncate_inode_items+0x133/0x1150 [btrfs]
[ 2780.030151] ? btrfs_set_path_blocking+0xb2/0x160 [btrfs]
[ 2780.030165] ? btrfs_search_slot+0x379/0x1000 [btrfs]
[ 2780.030195] btrfs_log_changed_extents.isra.8+0x841/0x93e [btrfs]
[ 2780.030202] ? do_raw_spin_unlock+0x49/0xc0
[ 2780.030215] ? btrfs_get_num_csums+0x10/0x10 [btrfs]
[ 2780.030239] btrfs_log_inode+0xf83/0x1124 [btrfs]
[ 2780.030251] ? __mutex_unlock_slowpath+0x45/0x2a0
[ 2780.030275] btrfs_log_inode_parent+0x2a0/0xe40 [btrfs]
[ 2780.030282] ? dget_parent+0xa1/0x370
[ 2780.030309] btrfs_log_dentry_safe+0x4a/0x70 [btrfs]
[ 2780.030329] btrfs_sync_file+0x3f3/0x490 [btrfs]
[ 2780.030339] do_fsync+0x38/0x60
[ 2780.030343] __x64_sys_fdatasync+0x13/0x20
[ 2780.030345] do_syscall_64+0x5c/0x280
[ 2780.030348] entry_SYSCALL_64_after_hwframe+0x49/0xbe
[ 2780.030356] RIP: 0033:0x7f2d80f6d5f0
[ 2780.030361] Code: Bad RIP value.
[ 2780.030362] RSP: 002b:00007ffdba3c8548 EFLAGS: 00000246 ORIG_RAX: 000000000000004b
[ 2780.030364] RAX: ffffffffffffffda RBX: 0000000000000003 RCX: 00007f2d80f6d5f0
[ 2780.030365] RDX: 00007ffdba3c84b0 RSI: 00007ffdba3c84b0 RDI: 0000000000000003
[ 2780.030367] RBP: 000000000000004a R08: 0000000000000001 R09: 00007ffdba3c855c
[ 2780.030368] R10: 0000000000000078 R11: 0000000000000246 R12: 00000000000001f4
[ 2780.030369] R13: 0000000051eb851f R14: 00007ffdba3c85f0 R15: 0000557a49220d90
So fix this by making btrfs_truncate_inode_items() not lock the range in
the inode's iotree when the target root is a log root, since it's not
needed to lock the range for log roots as the protection from the inode's
lock and log_mutex are all that's needed.
Fixes: 28553fa992cb28 ("Btrfs: fix race between shrinking truncate and fiemap")
CC: stable(a)vger.kernel.org # 4.4+
Reviewed-by: Josef Bacik <josef(a)toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana(a)suse.com>
Signed-off-by: David Sterba <dsterba(a)suse.com>
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 4f47ba652b31..1ccb3f8d528d 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -4103,8 +4103,9 @@ int btrfs_truncate_inode_items(struct btrfs_trans_handle *trans,
return -ENOMEM;
path->reada = READA_BACK;
- lock_extent_bits(&BTRFS_I(inode)->io_tree, lock_start, (u64)-1,
- &cached_state);
+ if (root->root_key.objectid != BTRFS_TREE_LOG_OBJECTID)
+ lock_extent_bits(&BTRFS_I(inode)->io_tree, lock_start, (u64)-1,
+ &cached_state);
/*
* We want to drop from the next block forward in case this new size is
@@ -4368,11 +4369,10 @@ int btrfs_truncate_inode_items(struct btrfs_trans_handle *trans,
if (!ret && last_size > new_size)
last_size = new_size;
btrfs_ordered_update_i_size(inode, last_size, NULL);
+ unlock_extent_cached(&BTRFS_I(inode)->io_tree, lock_start,
+ (u64)-1, &cached_state);
}
- unlock_extent_cached(&BTRFS_I(inode)->io_tree, lock_start, (u64)-1,
- &cached_state);
-
btrfs_free_path(path);
return ret;
}
The patch below does not apply to the 4.19-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
>From a5ae50dea9111db63d30d700766dd5509602f7ad Mon Sep 17 00:00:00 2001
From: Filipe Manana <fdmanana(a)suse.com>
Date: Thu, 20 Feb 2020 13:29:49 +0000
Subject: [PATCH] Btrfs: fix deadlock during fast fsync when logging prealloc
extents beyond eof
While logging the prealloc extents of an inode during a fast fsync we call
btrfs_truncate_inode_items(), through btrfs_log_prealloc_extents(), while
holding a read lock on a leaf of the inode's root (not the log root, the
fs/subvol root), and then that function locks the file range in the inode's
iotree. This can lead to a deadlock when:
* the fsync is ranged
* the file has prealloc extents beyond eof
* writeback for a range different from the fsync range starts
during the fsync
* the size of the file is not sector size aligned
Because when finishing an ordered extent we lock first a file range and
then try to COW the fs/subvol tree to insert an extent item.
The following diagram shows how the deadlock can happen.
CPU 1 CPU 2
btrfs_sync_file()
--> for range [0, 1MiB)
--> inode has a size of
1MiB and has 1 prealloc
extent beyond the
i_size, starting at offset
4MiB
flushes all delalloc for the
range [0MiB, 1MiB) and waits
for the respective ordered
extents to complete
--> before task at CPU 1 locks the
inode, a write into file range
[1MiB, 2MiB + 1KiB) is made
--> i_size is updated to 2MiB + 1KiB
--> writeback is started for that
range, [1MiB, 2MiB + 4KiB)
--> end offset rounded up to
be sector size aligned
btrfs_log_dentry_safe()
btrfs_log_inode_parent()
btrfs_log_inode()
btrfs_log_changed_extents()
btrfs_log_prealloc_extents()
--> does a search on the
inode's root
--> holds a read lock on
leaf X
btrfs_finish_ordered_io()
--> locks range [1MiB, 2MiB + 4KiB)
--> end offset rounded up
to be sector size aligned
--> tries to cow leaf X, through
insert_reserved_file_extent()
--> already locked by the
task at CPU 1
btrfs_truncate_inode_items()
--> gets an i_size of
2MiB + 1KiB, which is
not sector size
aligned
--> tries to lock file
range [2MiB, (u64)-1)
--> the start range
is rounded down
from 2MiB + 1K
to 2MiB to be sector
size aligned
--> but the subrange
[2MiB, 2MiB + 4KiB) is
already locked by
task at CPU 2 which
is waiting to get a
write lock on leaf X
for which we are
holding a read lock
*** deadlock ***
This results in a stack trace like the following, triggered by test case
generic/561 from fstests:
[ 2779.973608] INFO: task kworker/u8:6:247 blocked for more than 120 seconds.
[ 2779.979536] Not tainted 5.6.0-rc2-btrfs-next-53 #1
[ 2779.984503] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 2779.990136] kworker/u8:6 D 0 247 2 0x80004000
[ 2779.990457] Workqueue: btrfs-endio-write btrfs_work_helper [btrfs]
[ 2779.990466] Call Trace:
[ 2779.990491] ? __schedule+0x384/0xa30
[ 2779.990521] schedule+0x33/0xe0
[ 2779.990616] btrfs_tree_read_lock+0x19e/0x2e0 [btrfs]
[ 2779.990632] ? remove_wait_queue+0x60/0x60
[ 2779.990730] btrfs_read_lock_root_node+0x2f/0x40 [btrfs]
[ 2779.990782] btrfs_search_slot+0x510/0x1000 [btrfs]
[ 2779.990869] btrfs_lookup_file_extent+0x4a/0x70 [btrfs]
[ 2779.990944] __btrfs_drop_extents+0x161/0x1060 [btrfs]
[ 2779.990987] ? mark_held_locks+0x6d/0xc0
[ 2779.990994] ? __slab_alloc.isra.49+0x99/0x100
[ 2779.991060] ? insert_reserved_file_extent.constprop.19+0x64/0x300 [btrfs]
[ 2779.991145] insert_reserved_file_extent.constprop.19+0x97/0x300 [btrfs]
[ 2779.991222] ? start_transaction+0xdd/0x5c0 [btrfs]
[ 2779.991291] btrfs_finish_ordered_io+0x4f4/0x840 [btrfs]
[ 2779.991405] btrfs_work_helper+0xaa/0x720 [btrfs]
[ 2779.991432] process_one_work+0x26d/0x6a0
[ 2779.991460] worker_thread+0x4f/0x3e0
[ 2779.991481] ? process_one_work+0x6a0/0x6a0
[ 2779.991489] kthread+0x103/0x140
[ 2779.991499] ? kthread_create_worker_on_cpu+0x70/0x70
[ 2779.991515] ret_from_fork+0x3a/0x50
(...)
[ 2780.026211] INFO: task fsstress:17375 blocked for more than 120 seconds.
[ 2780.027480] Not tainted 5.6.0-rc2-btrfs-next-53 #1
[ 2780.028482] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 2780.030035] fsstress D 0 17375 17373 0x00004000
[ 2780.030038] Call Trace:
[ 2780.030044] ? __schedule+0x384/0xa30
[ 2780.030052] schedule+0x33/0xe0
[ 2780.030075] lock_extent_bits+0x20c/0x320 [btrfs]
[ 2780.030094] ? btrfs_truncate_inode_items+0xf4/0x1150 [btrfs]
[ 2780.030098] ? rcu_read_lock_sched_held+0x59/0xa0
[ 2780.030102] ? remove_wait_queue+0x60/0x60
[ 2780.030122] btrfs_truncate_inode_items+0x133/0x1150 [btrfs]
[ 2780.030151] ? btrfs_set_path_blocking+0xb2/0x160 [btrfs]
[ 2780.030165] ? btrfs_search_slot+0x379/0x1000 [btrfs]
[ 2780.030195] btrfs_log_changed_extents.isra.8+0x841/0x93e [btrfs]
[ 2780.030202] ? do_raw_spin_unlock+0x49/0xc0
[ 2780.030215] ? btrfs_get_num_csums+0x10/0x10 [btrfs]
[ 2780.030239] btrfs_log_inode+0xf83/0x1124 [btrfs]
[ 2780.030251] ? __mutex_unlock_slowpath+0x45/0x2a0
[ 2780.030275] btrfs_log_inode_parent+0x2a0/0xe40 [btrfs]
[ 2780.030282] ? dget_parent+0xa1/0x370
[ 2780.030309] btrfs_log_dentry_safe+0x4a/0x70 [btrfs]
[ 2780.030329] btrfs_sync_file+0x3f3/0x490 [btrfs]
[ 2780.030339] do_fsync+0x38/0x60
[ 2780.030343] __x64_sys_fdatasync+0x13/0x20
[ 2780.030345] do_syscall_64+0x5c/0x280
[ 2780.030348] entry_SYSCALL_64_after_hwframe+0x49/0xbe
[ 2780.030356] RIP: 0033:0x7f2d80f6d5f0
[ 2780.030361] Code: Bad RIP value.
[ 2780.030362] RSP: 002b:00007ffdba3c8548 EFLAGS: 00000246 ORIG_RAX: 000000000000004b
[ 2780.030364] RAX: ffffffffffffffda RBX: 0000000000000003 RCX: 00007f2d80f6d5f0
[ 2780.030365] RDX: 00007ffdba3c84b0 RSI: 00007ffdba3c84b0 RDI: 0000000000000003
[ 2780.030367] RBP: 000000000000004a R08: 0000000000000001 R09: 00007ffdba3c855c
[ 2780.030368] R10: 0000000000000078 R11: 0000000000000246 R12: 00000000000001f4
[ 2780.030369] R13: 0000000051eb851f R14: 00007ffdba3c85f0 R15: 0000557a49220d90
So fix this by making btrfs_truncate_inode_items() not lock the range in
the inode's iotree when the target root is a log root, since it's not
needed to lock the range for log roots as the protection from the inode's
lock and log_mutex are all that's needed.
Fixes: 28553fa992cb28 ("Btrfs: fix race between shrinking truncate and fiemap")
CC: stable(a)vger.kernel.org # 4.4+
Reviewed-by: Josef Bacik <josef(a)toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana(a)suse.com>
Signed-off-by: David Sterba <dsterba(a)suse.com>
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 4f47ba652b31..1ccb3f8d528d 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -4103,8 +4103,9 @@ int btrfs_truncate_inode_items(struct btrfs_trans_handle *trans,
return -ENOMEM;
path->reada = READA_BACK;
- lock_extent_bits(&BTRFS_I(inode)->io_tree, lock_start, (u64)-1,
- &cached_state);
+ if (root->root_key.objectid != BTRFS_TREE_LOG_OBJECTID)
+ lock_extent_bits(&BTRFS_I(inode)->io_tree, lock_start, (u64)-1,
+ &cached_state);
/*
* We want to drop from the next block forward in case this new size is
@@ -4368,11 +4369,10 @@ int btrfs_truncate_inode_items(struct btrfs_trans_handle *trans,
if (!ret && last_size > new_size)
last_size = new_size;
btrfs_ordered_update_i_size(inode, last_size, NULL);
+ unlock_extent_cached(&BTRFS_I(inode)->io_tree, lock_start,
+ (u64)-1, &cached_state);
}
- unlock_extent_cached(&BTRFS_I(inode)->io_tree, lock_start, (u64)-1,
- &cached_state);
-
btrfs_free_path(path);
return ret;
}
The patch below does not apply to the 4.4-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
>From 52e29e331070cd7d52a64cbf1b0958212a340e28 Mon Sep 17 00:00:00 2001
From: Josef Bacik <josef(a)toxicpanda.com>
Date: Fri, 17 Jan 2020 09:02:20 -0500
Subject: [PATCH] btrfs: don't set path->leave_spinning for truncate
The only time we actually leave the path spinning is if we're truncating
a small amount and don't actually free an extent, which is not a common
occurrence. We have to set the path blocking in order to add the
delayed ref anyway, so the first extent we find we set the path to
blocking and stay blocking for the duration of the operation. With the
upcoming file extent map stuff there will be another case that we have
to have the path blocking, so just swap to blocking always.
Note: this patch also fixes a warning after 28553fa992cb ("Btrfs: fix
race between shrinking truncate and fiemap") got merged that inserts
extent locks around truncation so the path must not leave spinning locks
after btrfs_search_slot.
[70.794783] BUG: sleeping function called from invalid context at mm/slab.h:565
[70.794834] in_atomic(): 1, irqs_disabled(): 0, non_block: 0, pid: 1141, name: rsync
[70.794863] 5 locks held by rsync/1141:
[70.794876] #0: ffff888417b9c408 (sb_writers#17){.+.+}, at: mnt_want_write+0x20/0x50
[70.795030] #1: ffff888428de28e8 (&type->i_mutex_dir_key#13/1){+.+.}, at: lock_rename+0xf1/0x100
[70.795051] #2: ffff888417b9c608 (sb_internal#2){.+.+}, at: start_transaction+0x394/0x560
[70.795124] #3: ffff888403081768 (btrfs-fs-01){++++}, at: btrfs_try_tree_write_lock+0x2f/0x160
[70.795203] #4: ffff888403086568 (btrfs-fs-00){++++}, at: btrfs_try_tree_write_lock+0x2f/0x160
[70.795222] CPU: 5 PID: 1141 Comm: rsync Not tainted 5.6.0-rc2-backup+ #2
[70.795362] Call Trace:
[70.795374] dump_stack+0x71/0xa0
[70.795445] ___might_sleep.part.96.cold.106+0xa6/0xb6
[70.795459] kmem_cache_alloc+0x1d3/0x290
[70.795471] alloc_extent_state+0x22/0x1c0
[70.795544] __clear_extent_bit+0x3ba/0x580
[70.795557] ? _raw_spin_unlock_irq+0x24/0x30
[70.795569] btrfs_truncate_inode_items+0x339/0xe50
[70.795647] btrfs_evict_inode+0x269/0x540
[70.795659] ? dput.part.38+0x29/0x460
[70.795671] evict+0xcd/0x190
[70.795682] __dentry_kill+0xd6/0x180
[70.795754] dput.part.38+0x2ad/0x460
[70.795765] do_renameat2+0x3cb/0x540
[70.795777] __x64_sys_rename+0x1c/0x20
Reported-by: Dave Jones <davej(a)codemonkey.org.uk>
Fixes: 28553fa992cb ("Btrfs: fix race between shrinking truncate and fiemap")
CC: stable(a)vger.kernel.org # 4.4+
Reviewed-by: Filipe Manana <fdmanana(a)suse.com>
Signed-off-by: Josef Bacik <josef(a)toxicpanda.com>
Reviewed-by: David Sterba <dsterba(a)suse.com>
[ add note ]
Signed-off-by: David Sterba <dsterba(a)suse.com>
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 7d26b4bfb2c6..36deef69f847 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -4142,7 +4142,6 @@ int btrfs_truncate_inode_items(struct btrfs_trans_handle *trans,
goto out;
}
- path->leave_spinning = 1;
ret = btrfs_search_slot(trans, root, &key, path, -1, 1);
if (ret < 0)
goto out;
@@ -4294,7 +4293,6 @@ int btrfs_truncate_inode_items(struct btrfs_trans_handle *trans,
root == fs_info->tree_root)) {
struct btrfs_ref ref = { 0 };
- btrfs_set_path_blocking(path);
bytes_deleted += extent_num_bytes;
btrfs_init_generic_ref(&ref, BTRFS_DROP_DELAYED_REF,
The patch below does not apply to the 4.9-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
>From 52e29e331070cd7d52a64cbf1b0958212a340e28 Mon Sep 17 00:00:00 2001
From: Josef Bacik <josef(a)toxicpanda.com>
Date: Fri, 17 Jan 2020 09:02:20 -0500
Subject: [PATCH] btrfs: don't set path->leave_spinning for truncate
The only time we actually leave the path spinning is if we're truncating
a small amount and don't actually free an extent, which is not a common
occurrence. We have to set the path blocking in order to add the
delayed ref anyway, so the first extent we find we set the path to
blocking and stay blocking for the duration of the operation. With the
upcoming file extent map stuff there will be another case that we have
to have the path blocking, so just swap to blocking always.
Note: this patch also fixes a warning after 28553fa992cb ("Btrfs: fix
race between shrinking truncate and fiemap") got merged that inserts
extent locks around truncation so the path must not leave spinning locks
after btrfs_search_slot.
[70.794783] BUG: sleeping function called from invalid context at mm/slab.h:565
[70.794834] in_atomic(): 1, irqs_disabled(): 0, non_block: 0, pid: 1141, name: rsync
[70.794863] 5 locks held by rsync/1141:
[70.794876] #0: ffff888417b9c408 (sb_writers#17){.+.+}, at: mnt_want_write+0x20/0x50
[70.795030] #1: ffff888428de28e8 (&type->i_mutex_dir_key#13/1){+.+.}, at: lock_rename+0xf1/0x100
[70.795051] #2: ffff888417b9c608 (sb_internal#2){.+.+}, at: start_transaction+0x394/0x560
[70.795124] #3: ffff888403081768 (btrfs-fs-01){++++}, at: btrfs_try_tree_write_lock+0x2f/0x160
[70.795203] #4: ffff888403086568 (btrfs-fs-00){++++}, at: btrfs_try_tree_write_lock+0x2f/0x160
[70.795222] CPU: 5 PID: 1141 Comm: rsync Not tainted 5.6.0-rc2-backup+ #2
[70.795362] Call Trace:
[70.795374] dump_stack+0x71/0xa0
[70.795445] ___might_sleep.part.96.cold.106+0xa6/0xb6
[70.795459] kmem_cache_alloc+0x1d3/0x290
[70.795471] alloc_extent_state+0x22/0x1c0
[70.795544] __clear_extent_bit+0x3ba/0x580
[70.795557] ? _raw_spin_unlock_irq+0x24/0x30
[70.795569] btrfs_truncate_inode_items+0x339/0xe50
[70.795647] btrfs_evict_inode+0x269/0x540
[70.795659] ? dput.part.38+0x29/0x460
[70.795671] evict+0xcd/0x190
[70.795682] __dentry_kill+0xd6/0x180
[70.795754] dput.part.38+0x2ad/0x460
[70.795765] do_renameat2+0x3cb/0x540
[70.795777] __x64_sys_rename+0x1c/0x20
Reported-by: Dave Jones <davej(a)codemonkey.org.uk>
Fixes: 28553fa992cb ("Btrfs: fix race between shrinking truncate and fiemap")
CC: stable(a)vger.kernel.org # 4.4+
Reviewed-by: Filipe Manana <fdmanana(a)suse.com>
Signed-off-by: Josef Bacik <josef(a)toxicpanda.com>
Reviewed-by: David Sterba <dsterba(a)suse.com>
[ add note ]
Signed-off-by: David Sterba <dsterba(a)suse.com>
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 7d26b4bfb2c6..36deef69f847 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -4142,7 +4142,6 @@ int btrfs_truncate_inode_items(struct btrfs_trans_handle *trans,
goto out;
}
- path->leave_spinning = 1;
ret = btrfs_search_slot(trans, root, &key, path, -1, 1);
if (ret < 0)
goto out;
@@ -4294,7 +4293,6 @@ int btrfs_truncate_inode_items(struct btrfs_trans_handle *trans,
root == fs_info->tree_root)) {
struct btrfs_ref ref = { 0 };
- btrfs_set_path_blocking(path);
bytes_deleted += extent_num_bytes;
btrfs_init_generic_ref(&ref, BTRFS_DROP_DELAYED_REF,
The patch below does not apply to the 4.14-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
>From 52e29e331070cd7d52a64cbf1b0958212a340e28 Mon Sep 17 00:00:00 2001
From: Josef Bacik <josef(a)toxicpanda.com>
Date: Fri, 17 Jan 2020 09:02:20 -0500
Subject: [PATCH] btrfs: don't set path->leave_spinning for truncate
The only time we actually leave the path spinning is if we're truncating
a small amount and don't actually free an extent, which is not a common
occurrence. We have to set the path blocking in order to add the
delayed ref anyway, so the first extent we find we set the path to
blocking and stay blocking for the duration of the operation. With the
upcoming file extent map stuff there will be another case that we have
to have the path blocking, so just swap to blocking always.
Note: this patch also fixes a warning after 28553fa992cb ("Btrfs: fix
race between shrinking truncate and fiemap") got merged that inserts
extent locks around truncation so the path must not leave spinning locks
after btrfs_search_slot.
[70.794783] BUG: sleeping function called from invalid context at mm/slab.h:565
[70.794834] in_atomic(): 1, irqs_disabled(): 0, non_block: 0, pid: 1141, name: rsync
[70.794863] 5 locks held by rsync/1141:
[70.794876] #0: ffff888417b9c408 (sb_writers#17){.+.+}, at: mnt_want_write+0x20/0x50
[70.795030] #1: ffff888428de28e8 (&type->i_mutex_dir_key#13/1){+.+.}, at: lock_rename+0xf1/0x100
[70.795051] #2: ffff888417b9c608 (sb_internal#2){.+.+}, at: start_transaction+0x394/0x560
[70.795124] #3: ffff888403081768 (btrfs-fs-01){++++}, at: btrfs_try_tree_write_lock+0x2f/0x160
[70.795203] #4: ffff888403086568 (btrfs-fs-00){++++}, at: btrfs_try_tree_write_lock+0x2f/0x160
[70.795222] CPU: 5 PID: 1141 Comm: rsync Not tainted 5.6.0-rc2-backup+ #2
[70.795362] Call Trace:
[70.795374] dump_stack+0x71/0xa0
[70.795445] ___might_sleep.part.96.cold.106+0xa6/0xb6
[70.795459] kmem_cache_alloc+0x1d3/0x290
[70.795471] alloc_extent_state+0x22/0x1c0
[70.795544] __clear_extent_bit+0x3ba/0x580
[70.795557] ? _raw_spin_unlock_irq+0x24/0x30
[70.795569] btrfs_truncate_inode_items+0x339/0xe50
[70.795647] btrfs_evict_inode+0x269/0x540
[70.795659] ? dput.part.38+0x29/0x460
[70.795671] evict+0xcd/0x190
[70.795682] __dentry_kill+0xd6/0x180
[70.795754] dput.part.38+0x2ad/0x460
[70.795765] do_renameat2+0x3cb/0x540
[70.795777] __x64_sys_rename+0x1c/0x20
Reported-by: Dave Jones <davej(a)codemonkey.org.uk>
Fixes: 28553fa992cb ("Btrfs: fix race between shrinking truncate and fiemap")
CC: stable(a)vger.kernel.org # 4.4+
Reviewed-by: Filipe Manana <fdmanana(a)suse.com>
Signed-off-by: Josef Bacik <josef(a)toxicpanda.com>
Reviewed-by: David Sterba <dsterba(a)suse.com>
[ add note ]
Signed-off-by: David Sterba <dsterba(a)suse.com>
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 7d26b4bfb2c6..36deef69f847 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -4142,7 +4142,6 @@ int btrfs_truncate_inode_items(struct btrfs_trans_handle *trans,
goto out;
}
- path->leave_spinning = 1;
ret = btrfs_search_slot(trans, root, &key, path, -1, 1);
if (ret < 0)
goto out;
@@ -4294,7 +4293,6 @@ int btrfs_truncate_inode_items(struct btrfs_trans_handle *trans,
root == fs_info->tree_root)) {
struct btrfs_ref ref = { 0 };
- btrfs_set_path_blocking(path);
bytes_deleted += extent_num_bytes;
btrfs_init_generic_ref(&ref, BTRFS_DROP_DELAYED_REF,
The patch below does not apply to the 4.19-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
>From 52e29e331070cd7d52a64cbf1b0958212a340e28 Mon Sep 17 00:00:00 2001
From: Josef Bacik <josef(a)toxicpanda.com>
Date: Fri, 17 Jan 2020 09:02:20 -0500
Subject: [PATCH] btrfs: don't set path->leave_spinning for truncate
The only time we actually leave the path spinning is if we're truncating
a small amount and don't actually free an extent, which is not a common
occurrence. We have to set the path blocking in order to add the
delayed ref anyway, so the first extent we find we set the path to
blocking and stay blocking for the duration of the operation. With the
upcoming file extent map stuff there will be another case that we have
to have the path blocking, so just swap to blocking always.
Note: this patch also fixes a warning after 28553fa992cb ("Btrfs: fix
race between shrinking truncate and fiemap") got merged that inserts
extent locks around truncation so the path must not leave spinning locks
after btrfs_search_slot.
[70.794783] BUG: sleeping function called from invalid context at mm/slab.h:565
[70.794834] in_atomic(): 1, irqs_disabled(): 0, non_block: 0, pid: 1141, name: rsync
[70.794863] 5 locks held by rsync/1141:
[70.794876] #0: ffff888417b9c408 (sb_writers#17){.+.+}, at: mnt_want_write+0x20/0x50
[70.795030] #1: ffff888428de28e8 (&type->i_mutex_dir_key#13/1){+.+.}, at: lock_rename+0xf1/0x100
[70.795051] #2: ffff888417b9c608 (sb_internal#2){.+.+}, at: start_transaction+0x394/0x560
[70.795124] #3: ffff888403081768 (btrfs-fs-01){++++}, at: btrfs_try_tree_write_lock+0x2f/0x160
[70.795203] #4: ffff888403086568 (btrfs-fs-00){++++}, at: btrfs_try_tree_write_lock+0x2f/0x160
[70.795222] CPU: 5 PID: 1141 Comm: rsync Not tainted 5.6.0-rc2-backup+ #2
[70.795362] Call Trace:
[70.795374] dump_stack+0x71/0xa0
[70.795445] ___might_sleep.part.96.cold.106+0xa6/0xb6
[70.795459] kmem_cache_alloc+0x1d3/0x290
[70.795471] alloc_extent_state+0x22/0x1c0
[70.795544] __clear_extent_bit+0x3ba/0x580
[70.795557] ? _raw_spin_unlock_irq+0x24/0x30
[70.795569] btrfs_truncate_inode_items+0x339/0xe50
[70.795647] btrfs_evict_inode+0x269/0x540
[70.795659] ? dput.part.38+0x29/0x460
[70.795671] evict+0xcd/0x190
[70.795682] __dentry_kill+0xd6/0x180
[70.795754] dput.part.38+0x2ad/0x460
[70.795765] do_renameat2+0x3cb/0x540
[70.795777] __x64_sys_rename+0x1c/0x20
Reported-by: Dave Jones <davej(a)codemonkey.org.uk>
Fixes: 28553fa992cb ("Btrfs: fix race between shrinking truncate and fiemap")
CC: stable(a)vger.kernel.org # 4.4+
Reviewed-by: Filipe Manana <fdmanana(a)suse.com>
Signed-off-by: Josef Bacik <josef(a)toxicpanda.com>
Reviewed-by: David Sterba <dsterba(a)suse.com>
[ add note ]
Signed-off-by: David Sterba <dsterba(a)suse.com>
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 7d26b4bfb2c6..36deef69f847 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -4142,7 +4142,6 @@ int btrfs_truncate_inode_items(struct btrfs_trans_handle *trans,
goto out;
}
- path->leave_spinning = 1;
ret = btrfs_search_slot(trans, root, &key, path, -1, 1);
if (ret < 0)
goto out;
@@ -4294,7 +4293,6 @@ int btrfs_truncate_inode_items(struct btrfs_trans_handle *trans,
root == fs_info->tree_root)) {
struct btrfs_ref ref = { 0 };
- btrfs_set_path_blocking(path);
bytes_deleted += extent_num_bytes;
btrfs_init_generic_ref(&ref, BTRFS_DROP_DELAYED_REF,
The patch below does not apply to the 4.4-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
>From 28553fa992cb28be6a65566681aac6cafabb4f2d Mon Sep 17 00:00:00 2001
From: Filipe Manana <fdmanana(a)suse.com>
Date: Fri, 7 Feb 2020 12:23:09 +0000
Subject: [PATCH] Btrfs: fix race between shrinking truncate and fiemap
When there is a fiemap executing in parallel with a shrinking truncate
we can end up in a situation where we have extent maps for which we no
longer have corresponding file extent items. This is generally harmless
and at the moment the only consequences are missing file extent items
representing holes after we expand the file size again after the
truncate operation removed the prealloc extent items, and stale
information for future fiemap calls (reporting extents that no longer
exist or may have been reallocated to other files for example).
Consider the following example:
1) Our inode has a size of 128KiB, one 128KiB extent at file offset 0
and a 1MiB prealloc extent at file offset 128KiB;
2) Task A starts doing a shrinking truncate of our inode to reduce it to
a size of 64KiB. Before it searches the subvolume tree for file
extent items to delete, it drops all the extent maps in the range
from 64KiB to (u64)-1 by calling btrfs_drop_extent_cache();
3) Task B starts doing a fiemap against our inode. When looking up for
the inode's extent maps in the range from 128KiB to (u64)-1, it
doesn't find any in the inode's extent map tree, since they were
removed by task A. Because it didn't find any in the extent map
tree, it scans the inode's subvolume tree for file extent items, and
it finds the 1MiB prealloc extent at file offset 128KiB, then it
creates an extent map based on that file extent item and adds it to
inode's extent map tree (this ends up being done by
btrfs_get_extent() <- btrfs_get_extent_fiemap() <-
get_extent_skip_holes());
4) Task A then drops the prealloc extent at file offset 128KiB and
shrinks the 128KiB extent file offset 0 to a length of 64KiB. The
truncation operation finishes and we end up with an extent map
representing a 1MiB prealloc extent at file offset 128KiB, despite we
don't have any more that extent;
After this the two types of problems we have are:
1) Future calls to fiemap always report that a 1MiB prealloc extent
exists at file offset 128KiB. This is stale information, no longer
correct;
2) If the size of the file is increased, by a truncate operation that
increases the file size or by a write into a file offset > 64KiB for
example, we end up not inserting file extent items to represent holes
for any range between 128KiB and 128KiB + 1MiB, since the hole
expansion function, btrfs_cont_expand() will skip hole insertion for
any range for which an extent map exists that represents a prealloc
extent. This causes fsck to complain about missing file extent items
when not using the NO_HOLES feature.
The second issue could be often triggered by test case generic/561 from
fstests, which runs fsstress and duperemove in parallel, and duperemove
does frequent fiemap calls.
Essentially the problems happens because fiemap does not acquire the
inode's lock while truncate does, and fiemap locks the file range in the
inode's iotree while truncate does not. So fix the issue by making
btrfs_truncate_inode_items() lock the file range from the new file size
to (u64)-1, so that it serializes with fiemap.
CC: stable(a)vger.kernel.org # 4.4+
Reviewed-by: Josef Bacik <josef(a)toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana(a)suse.com>
Reviewed-by: David Sterba <dsterba(a)suse.com>
Signed-off-by: David Sterba <dsterba(a)suse.com>
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 5b3ec93ff911..7d26b4bfb2c6 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -4085,6 +4085,8 @@ int btrfs_truncate_inode_items(struct btrfs_trans_handle *trans,
u64 bytes_deleted = 0;
bool be_nice = false;
bool should_throttle = false;
+ const u64 lock_start = ALIGN_DOWN(new_size, fs_info->sectorsize);
+ struct extent_state *cached_state = NULL;
BUG_ON(new_size > 0 && min_type != BTRFS_EXTENT_DATA_KEY);
@@ -4101,6 +4103,9 @@ int btrfs_truncate_inode_items(struct btrfs_trans_handle *trans,
return -ENOMEM;
path->reada = READA_BACK;
+ lock_extent_bits(&BTRFS_I(inode)->io_tree, lock_start, (u64)-1,
+ &cached_state);
+
/*
* We want to drop from the next block forward in case this new size is
* not block aligned since we will be keeping the last block of the
@@ -4367,6 +4372,9 @@ int btrfs_truncate_inode_items(struct btrfs_trans_handle *trans,
btrfs_ordered_update_i_size(inode, last_size, NULL);
}
+ unlock_extent_cached(&BTRFS_I(inode)->io_tree, lock_start, (u64)-1,
+ &cached_state);
+
btrfs_free_path(path);
return ret;
}
The patch below does not apply to the 4.9-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
>From 28553fa992cb28be6a65566681aac6cafabb4f2d Mon Sep 17 00:00:00 2001
From: Filipe Manana <fdmanana(a)suse.com>
Date: Fri, 7 Feb 2020 12:23:09 +0000
Subject: [PATCH] Btrfs: fix race between shrinking truncate and fiemap
When there is a fiemap executing in parallel with a shrinking truncate
we can end up in a situation where we have extent maps for which we no
longer have corresponding file extent items. This is generally harmless
and at the moment the only consequences are missing file extent items
representing holes after we expand the file size again after the
truncate operation removed the prealloc extent items, and stale
information for future fiemap calls (reporting extents that no longer
exist or may have been reallocated to other files for example).
Consider the following example:
1) Our inode has a size of 128KiB, one 128KiB extent at file offset 0
and a 1MiB prealloc extent at file offset 128KiB;
2) Task A starts doing a shrinking truncate of our inode to reduce it to
a size of 64KiB. Before it searches the subvolume tree for file
extent items to delete, it drops all the extent maps in the range
from 64KiB to (u64)-1 by calling btrfs_drop_extent_cache();
3) Task B starts doing a fiemap against our inode. When looking up for
the inode's extent maps in the range from 128KiB to (u64)-1, it
doesn't find any in the inode's extent map tree, since they were
removed by task A. Because it didn't find any in the extent map
tree, it scans the inode's subvolume tree for file extent items, and
it finds the 1MiB prealloc extent at file offset 128KiB, then it
creates an extent map based on that file extent item and adds it to
inode's extent map tree (this ends up being done by
btrfs_get_extent() <- btrfs_get_extent_fiemap() <-
get_extent_skip_holes());
4) Task A then drops the prealloc extent at file offset 128KiB and
shrinks the 128KiB extent file offset 0 to a length of 64KiB. The
truncation operation finishes and we end up with an extent map
representing a 1MiB prealloc extent at file offset 128KiB, despite we
don't have any more that extent;
After this the two types of problems we have are:
1) Future calls to fiemap always report that a 1MiB prealloc extent
exists at file offset 128KiB. This is stale information, no longer
correct;
2) If the size of the file is increased, by a truncate operation that
increases the file size or by a write into a file offset > 64KiB for
example, we end up not inserting file extent items to represent holes
for any range between 128KiB and 128KiB + 1MiB, since the hole
expansion function, btrfs_cont_expand() will skip hole insertion for
any range for which an extent map exists that represents a prealloc
extent. This causes fsck to complain about missing file extent items
when not using the NO_HOLES feature.
The second issue could be often triggered by test case generic/561 from
fstests, which runs fsstress and duperemove in parallel, and duperemove
does frequent fiemap calls.
Essentially the problems happens because fiemap does not acquire the
inode's lock while truncate does, and fiemap locks the file range in the
inode's iotree while truncate does not. So fix the issue by making
btrfs_truncate_inode_items() lock the file range from the new file size
to (u64)-1, so that it serializes with fiemap.
CC: stable(a)vger.kernel.org # 4.4+
Reviewed-by: Josef Bacik <josef(a)toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana(a)suse.com>
Reviewed-by: David Sterba <dsterba(a)suse.com>
Signed-off-by: David Sterba <dsterba(a)suse.com>
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 5b3ec93ff911..7d26b4bfb2c6 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -4085,6 +4085,8 @@ int btrfs_truncate_inode_items(struct btrfs_trans_handle *trans,
u64 bytes_deleted = 0;
bool be_nice = false;
bool should_throttle = false;
+ const u64 lock_start = ALIGN_DOWN(new_size, fs_info->sectorsize);
+ struct extent_state *cached_state = NULL;
BUG_ON(new_size > 0 && min_type != BTRFS_EXTENT_DATA_KEY);
@@ -4101,6 +4103,9 @@ int btrfs_truncate_inode_items(struct btrfs_trans_handle *trans,
return -ENOMEM;
path->reada = READA_BACK;
+ lock_extent_bits(&BTRFS_I(inode)->io_tree, lock_start, (u64)-1,
+ &cached_state);
+
/*
* We want to drop from the next block forward in case this new size is
* not block aligned since we will be keeping the last block of the
@@ -4367,6 +4372,9 @@ int btrfs_truncate_inode_items(struct btrfs_trans_handle *trans,
btrfs_ordered_update_i_size(inode, last_size, NULL);
}
+ unlock_extent_cached(&BTRFS_I(inode)->io_tree, lock_start, (u64)-1,
+ &cached_state);
+
btrfs_free_path(path);
return ret;
}
The patch below does not apply to the 4.14-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
>From 28553fa992cb28be6a65566681aac6cafabb4f2d Mon Sep 17 00:00:00 2001
From: Filipe Manana <fdmanana(a)suse.com>
Date: Fri, 7 Feb 2020 12:23:09 +0000
Subject: [PATCH] Btrfs: fix race between shrinking truncate and fiemap
When there is a fiemap executing in parallel with a shrinking truncate
we can end up in a situation where we have extent maps for which we no
longer have corresponding file extent items. This is generally harmless
and at the moment the only consequences are missing file extent items
representing holes after we expand the file size again after the
truncate operation removed the prealloc extent items, and stale
information for future fiemap calls (reporting extents that no longer
exist or may have been reallocated to other files for example).
Consider the following example:
1) Our inode has a size of 128KiB, one 128KiB extent at file offset 0
and a 1MiB prealloc extent at file offset 128KiB;
2) Task A starts doing a shrinking truncate of our inode to reduce it to
a size of 64KiB. Before it searches the subvolume tree for file
extent items to delete, it drops all the extent maps in the range
from 64KiB to (u64)-1 by calling btrfs_drop_extent_cache();
3) Task B starts doing a fiemap against our inode. When looking up for
the inode's extent maps in the range from 128KiB to (u64)-1, it
doesn't find any in the inode's extent map tree, since they were
removed by task A. Because it didn't find any in the extent map
tree, it scans the inode's subvolume tree for file extent items, and
it finds the 1MiB prealloc extent at file offset 128KiB, then it
creates an extent map based on that file extent item and adds it to
inode's extent map tree (this ends up being done by
btrfs_get_extent() <- btrfs_get_extent_fiemap() <-
get_extent_skip_holes());
4) Task A then drops the prealloc extent at file offset 128KiB and
shrinks the 128KiB extent file offset 0 to a length of 64KiB. The
truncation operation finishes and we end up with an extent map
representing a 1MiB prealloc extent at file offset 128KiB, despite we
don't have any more that extent;
After this the two types of problems we have are:
1) Future calls to fiemap always report that a 1MiB prealloc extent
exists at file offset 128KiB. This is stale information, no longer
correct;
2) If the size of the file is increased, by a truncate operation that
increases the file size or by a write into a file offset > 64KiB for
example, we end up not inserting file extent items to represent holes
for any range between 128KiB and 128KiB + 1MiB, since the hole
expansion function, btrfs_cont_expand() will skip hole insertion for
any range for which an extent map exists that represents a prealloc
extent. This causes fsck to complain about missing file extent items
when not using the NO_HOLES feature.
The second issue could be often triggered by test case generic/561 from
fstests, which runs fsstress and duperemove in parallel, and duperemove
does frequent fiemap calls.
Essentially the problems happens because fiemap does not acquire the
inode's lock while truncate does, and fiemap locks the file range in the
inode's iotree while truncate does not. So fix the issue by making
btrfs_truncate_inode_items() lock the file range from the new file size
to (u64)-1, so that it serializes with fiemap.
CC: stable(a)vger.kernel.org # 4.4+
Reviewed-by: Josef Bacik <josef(a)toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana(a)suse.com>
Reviewed-by: David Sterba <dsterba(a)suse.com>
Signed-off-by: David Sterba <dsterba(a)suse.com>
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 5b3ec93ff911..7d26b4bfb2c6 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -4085,6 +4085,8 @@ int btrfs_truncate_inode_items(struct btrfs_trans_handle *trans,
u64 bytes_deleted = 0;
bool be_nice = false;
bool should_throttle = false;
+ const u64 lock_start = ALIGN_DOWN(new_size, fs_info->sectorsize);
+ struct extent_state *cached_state = NULL;
BUG_ON(new_size > 0 && min_type != BTRFS_EXTENT_DATA_KEY);
@@ -4101,6 +4103,9 @@ int btrfs_truncate_inode_items(struct btrfs_trans_handle *trans,
return -ENOMEM;
path->reada = READA_BACK;
+ lock_extent_bits(&BTRFS_I(inode)->io_tree, lock_start, (u64)-1,
+ &cached_state);
+
/*
* We want to drop from the next block forward in case this new size is
* not block aligned since we will be keeping the last block of the
@@ -4367,6 +4372,9 @@ int btrfs_truncate_inode_items(struct btrfs_trans_handle *trans,
btrfs_ordered_update_i_size(inode, last_size, NULL);
}
+ unlock_extent_cached(&BTRFS_I(inode)->io_tree, lock_start, (u64)-1,
+ &cached_state);
+
btrfs_free_path(path);
return ret;
}
The patch below does not apply to the 4.19-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
>From 28553fa992cb28be6a65566681aac6cafabb4f2d Mon Sep 17 00:00:00 2001
From: Filipe Manana <fdmanana(a)suse.com>
Date: Fri, 7 Feb 2020 12:23:09 +0000
Subject: [PATCH] Btrfs: fix race between shrinking truncate and fiemap
When there is a fiemap executing in parallel with a shrinking truncate
we can end up in a situation where we have extent maps for which we no
longer have corresponding file extent items. This is generally harmless
and at the moment the only consequences are missing file extent items
representing holes after we expand the file size again after the
truncate operation removed the prealloc extent items, and stale
information for future fiemap calls (reporting extents that no longer
exist or may have been reallocated to other files for example).
Consider the following example:
1) Our inode has a size of 128KiB, one 128KiB extent at file offset 0
and a 1MiB prealloc extent at file offset 128KiB;
2) Task A starts doing a shrinking truncate of our inode to reduce it to
a size of 64KiB. Before it searches the subvolume tree for file
extent items to delete, it drops all the extent maps in the range
from 64KiB to (u64)-1 by calling btrfs_drop_extent_cache();
3) Task B starts doing a fiemap against our inode. When looking up for
the inode's extent maps in the range from 128KiB to (u64)-1, it
doesn't find any in the inode's extent map tree, since they were
removed by task A. Because it didn't find any in the extent map
tree, it scans the inode's subvolume tree for file extent items, and
it finds the 1MiB prealloc extent at file offset 128KiB, then it
creates an extent map based on that file extent item and adds it to
inode's extent map tree (this ends up being done by
btrfs_get_extent() <- btrfs_get_extent_fiemap() <-
get_extent_skip_holes());
4) Task A then drops the prealloc extent at file offset 128KiB and
shrinks the 128KiB extent file offset 0 to a length of 64KiB. The
truncation operation finishes and we end up with an extent map
representing a 1MiB prealloc extent at file offset 128KiB, despite we
don't have any more that extent;
After this the two types of problems we have are:
1) Future calls to fiemap always report that a 1MiB prealloc extent
exists at file offset 128KiB. This is stale information, no longer
correct;
2) If the size of the file is increased, by a truncate operation that
increases the file size or by a write into a file offset > 64KiB for
example, we end up not inserting file extent items to represent holes
for any range between 128KiB and 128KiB + 1MiB, since the hole
expansion function, btrfs_cont_expand() will skip hole insertion for
any range for which an extent map exists that represents a prealloc
extent. This causes fsck to complain about missing file extent items
when not using the NO_HOLES feature.
The second issue could be often triggered by test case generic/561 from
fstests, which runs fsstress and duperemove in parallel, and duperemove
does frequent fiemap calls.
Essentially the problems happens because fiemap does not acquire the
inode's lock while truncate does, and fiemap locks the file range in the
inode's iotree while truncate does not. So fix the issue by making
btrfs_truncate_inode_items() lock the file range from the new file size
to (u64)-1, so that it serializes with fiemap.
CC: stable(a)vger.kernel.org # 4.4+
Reviewed-by: Josef Bacik <josef(a)toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana(a)suse.com>
Reviewed-by: David Sterba <dsterba(a)suse.com>
Signed-off-by: David Sterba <dsterba(a)suse.com>
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 5b3ec93ff911..7d26b4bfb2c6 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -4085,6 +4085,8 @@ int btrfs_truncate_inode_items(struct btrfs_trans_handle *trans,
u64 bytes_deleted = 0;
bool be_nice = false;
bool should_throttle = false;
+ const u64 lock_start = ALIGN_DOWN(new_size, fs_info->sectorsize);
+ struct extent_state *cached_state = NULL;
BUG_ON(new_size > 0 && min_type != BTRFS_EXTENT_DATA_KEY);
@@ -4101,6 +4103,9 @@ int btrfs_truncate_inode_items(struct btrfs_trans_handle *trans,
return -ENOMEM;
path->reada = READA_BACK;
+ lock_extent_bits(&BTRFS_I(inode)->io_tree, lock_start, (u64)-1,
+ &cached_state);
+
/*
* We want to drop from the next block forward in case this new size is
* not block aligned since we will be keeping the last block of the
@@ -4367,6 +4372,9 @@ int btrfs_truncate_inode_items(struct btrfs_trans_handle *trans,
btrfs_ordered_update_i_size(inode, last_size, NULL);
}
+ unlock_extent_cached(&BTRFS_I(inode)->io_tree, lock_start, (u64)-1,
+ &cached_state);
+
btrfs_free_path(path);
return ret;
}
The patch below does not apply to the 4.9-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
>From 1e90315149f3fe148e114a5de86f0196d1c21fa5 Mon Sep 17 00:00:00 2001
From: Josef Bacik <josef(a)toxicpanda.com>
Date: Thu, 13 Feb 2020 10:47:29 -0500
Subject: [PATCH] btrfs: do not check delayed items are empty for single
transaction cleanup
btrfs_assert_delayed_root_empty() will check if the delayed root is
completely empty, but this is a filesystem-wide check. On cleanup we
may have allowed other transactions to begin, for whatever reason, and
thus the delayed root is not empty.
So remove this check from cleanup_one_transation(). This however can
stay in btrfs_cleanup_transaction(), because it checks only after all of
the transactions have been properly cleaned up, and thus is valid.
CC: stable(a)vger.kernel.org # 4.4+
Reviewed-by: Johannes Thumshirn <johannes.thumshirn(a)wdc.com>
Reviewed-by: Nikolay Borisov <nborisov(a)suse.com>
Reviewed-by: Qu Wenruo <wqu(a)suse.com>
Signed-off-by: Josef Bacik <josef(a)toxicpanda.com>
Reviewed-by: David Sterba <dsterba(a)suse.com>
Signed-off-by: David Sterba <dsterba(a)suse.com>
diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 197352f23534..c6c9a6a8e6c8 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -4503,7 +4503,6 @@ void btrfs_cleanup_one_transaction(struct btrfs_transaction *cur_trans,
wake_up(&fs_info->transaction_wait);
btrfs_destroy_delayed_inodes(fs_info);
- btrfs_assert_delayed_root_empty(fs_info);
btrfs_destroy_marked_extents(fs_info, &cur_trans->dirty_pages,
EXTENT_DIRTY);
The patch below does not apply to the 4.4-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
>From 1e90315149f3fe148e114a5de86f0196d1c21fa5 Mon Sep 17 00:00:00 2001
From: Josef Bacik <josef(a)toxicpanda.com>
Date: Thu, 13 Feb 2020 10:47:29 -0500
Subject: [PATCH] btrfs: do not check delayed items are empty for single
transaction cleanup
btrfs_assert_delayed_root_empty() will check if the delayed root is
completely empty, but this is a filesystem-wide check. On cleanup we
may have allowed other transactions to begin, for whatever reason, and
thus the delayed root is not empty.
So remove this check from cleanup_one_transation(). This however can
stay in btrfs_cleanup_transaction(), because it checks only after all of
the transactions have been properly cleaned up, and thus is valid.
CC: stable(a)vger.kernel.org # 4.4+
Reviewed-by: Johannes Thumshirn <johannes.thumshirn(a)wdc.com>
Reviewed-by: Nikolay Borisov <nborisov(a)suse.com>
Reviewed-by: Qu Wenruo <wqu(a)suse.com>
Signed-off-by: Josef Bacik <josef(a)toxicpanda.com>
Reviewed-by: David Sterba <dsterba(a)suse.com>
Signed-off-by: David Sterba <dsterba(a)suse.com>
diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 197352f23534..c6c9a6a8e6c8 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -4503,7 +4503,6 @@ void btrfs_cleanup_one_transaction(struct btrfs_transaction *cur_trans,
wake_up(&fs_info->transaction_wait);
btrfs_destroy_delayed_inodes(fs_info);
- btrfs_assert_delayed_root_empty(fs_info);
btrfs_destroy_marked_extents(fs_info, &cur_trans->dirty_pages,
EXTENT_DIRTY);
The patch below does not apply to the 4.4-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
>From 315bf8ef914f31d51d084af950703aa1e09a728c Mon Sep 17 00:00:00 2001
From: Josef Bacik <josef(a)toxicpanda.com>
Date: Thu, 13 Feb 2020 10:47:28 -0500
Subject: [PATCH] btrfs: reset fs_root to NULL on error in open_ctree
While running my error injection script I hit a panic when we tried to
clean up the fs_root when freeing the fs_root. This is because
fs_info->fs_root == PTR_ERR(-EIO), which isn't great. Fix this by
setting fs_info->fs_root = NULL; if we fail to read the root.
CC: stable(a)vger.kernel.org # 4.4+
Reviewed-by: Nikolay Borisov <nborisov(a)suse.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn(a)wdc.com>
Reviewed-by: Qu Wenruo <wqu(a)suse.com>
Signed-off-by: Josef Bacik <josef(a)toxicpanda.com>
Reviewed-by: David Sterba <dsterba(a)suse.com>
Signed-off-by: David Sterba <dsterba(a)suse.com>
diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index d7fec89974cb..197352f23534 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -3200,6 +3200,7 @@ int __cold open_ctree(struct super_block *sb,
if (IS_ERR(fs_info->fs_root)) {
err = PTR_ERR(fs_info->fs_root);
btrfs_warn(fs_info, "failed to read fs tree: %d", err);
+ fs_info->fs_root = NULL;
goto fail_qgroup;
}
The patch below does not apply to the 4.9-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
>From 315bf8ef914f31d51d084af950703aa1e09a728c Mon Sep 17 00:00:00 2001
From: Josef Bacik <josef(a)toxicpanda.com>
Date: Thu, 13 Feb 2020 10:47:28 -0500
Subject: [PATCH] btrfs: reset fs_root to NULL on error in open_ctree
While running my error injection script I hit a panic when we tried to
clean up the fs_root when freeing the fs_root. This is because
fs_info->fs_root == PTR_ERR(-EIO), which isn't great. Fix this by
setting fs_info->fs_root = NULL; if we fail to read the root.
CC: stable(a)vger.kernel.org # 4.4+
Reviewed-by: Nikolay Borisov <nborisov(a)suse.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn(a)wdc.com>
Reviewed-by: Qu Wenruo <wqu(a)suse.com>
Signed-off-by: Josef Bacik <josef(a)toxicpanda.com>
Reviewed-by: David Sterba <dsterba(a)suse.com>
Signed-off-by: David Sterba <dsterba(a)suse.com>
diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index d7fec89974cb..197352f23534 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -3200,6 +3200,7 @@ int __cold open_ctree(struct super_block *sb,
if (IS_ERR(fs_info->fs_root)) {
err = PTR_ERR(fs_info->fs_root);
btrfs_warn(fs_info, "failed to read fs tree: %d", err);
+ fs_info->fs_root = NULL;
goto fail_qgroup;
}
The patch below does not apply to the 4.14-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
>From 315bf8ef914f31d51d084af950703aa1e09a728c Mon Sep 17 00:00:00 2001
From: Josef Bacik <josef(a)toxicpanda.com>
Date: Thu, 13 Feb 2020 10:47:28 -0500
Subject: [PATCH] btrfs: reset fs_root to NULL on error in open_ctree
While running my error injection script I hit a panic when we tried to
clean up the fs_root when freeing the fs_root. This is because
fs_info->fs_root == PTR_ERR(-EIO), which isn't great. Fix this by
setting fs_info->fs_root = NULL; if we fail to read the root.
CC: stable(a)vger.kernel.org # 4.4+
Reviewed-by: Nikolay Borisov <nborisov(a)suse.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn(a)wdc.com>
Reviewed-by: Qu Wenruo <wqu(a)suse.com>
Signed-off-by: Josef Bacik <josef(a)toxicpanda.com>
Reviewed-by: David Sterba <dsterba(a)suse.com>
Signed-off-by: David Sterba <dsterba(a)suse.com>
diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index d7fec89974cb..197352f23534 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -3200,6 +3200,7 @@ int __cold open_ctree(struct super_block *sb,
if (IS_ERR(fs_info->fs_root)) {
err = PTR_ERR(fs_info->fs_root);
btrfs_warn(fs_info, "failed to read fs tree: %d", err);
+ fs_info->fs_root = NULL;
goto fail_qgroup;
}
The patch below does not apply to the 4.9-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
>From b778cf962d71a0e737923d55d0432f3bd287258e Mon Sep 17 00:00:00 2001
From: Josef Bacik <josef(a)toxicpanda.com>
Date: Thu, 13 Feb 2020 10:47:31 -0500
Subject: [PATCH] btrfs: fix bytes_may_use underflow in prealloc error
condtition
I hit the following warning while running my error injection stress
testing:
WARNING: CPU: 3 PID: 1453 at fs/btrfs/space-info.h:108 btrfs_free_reserved_data_space_noquota+0xfd/0x160 [btrfs]
RIP: 0010:btrfs_free_reserved_data_space_noquota+0xfd/0x160 [btrfs]
Call Trace:
btrfs_free_reserved_data_space+0x4f/0x70 [btrfs]
__btrfs_prealloc_file_range+0x378/0x470 [btrfs]
elfcorehdr_read+0x40/0x40
? elfcorehdr_read+0x40/0x40
? btrfs_commit_transaction+0xca/0xa50 [btrfs]
? dput+0xb4/0x2a0
? btrfs_log_dentry_safe+0x55/0x70 [btrfs]
? btrfs_sync_file+0x30e/0x420 [btrfs]
? do_fsync+0x38/0x70
? __x64_sys_fdatasync+0x13/0x20
? do_syscall_64+0x5b/0x1b0
? entry_SYSCALL_64_after_hwframe+0x44/0xa9
This happens if we fail to insert our reserved file extent. At this
point we've already converted our reservation from ->bytes_may_use to
->bytes_reserved. However once we break we will attempt to free
everything from [cur_offset, end] from ->bytes_may_use, but our extent
reservation will overlap part of this.
Fix this problem by adding ins.offset (our extent allocation size) to
cur_offset so we remove the actual remaining part from ->bytes_may_use.
I validated this fix using my inject-error.py script
python inject-error.py -o should_fail_bio -t cache_save_setup -t \
__btrfs_prealloc_file_range \
-t insert_reserved_file_extent.constprop.0 \
-r "-5" ./run-fsstress.sh
where run-fsstress.sh simply mounts and runs fsstress on a disk.
CC: stable(a)vger.kernel.org # 4.4+
Reviewed-by: Qu Wenruo <wqu(a)suse.com>
Signed-off-by: Josef Bacik <josef(a)toxicpanda.com>
Reviewed-by: David Sterba <dsterba(a)suse.com>
Signed-off-by: David Sterba <dsterba(a)suse.com>
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 36deef69f847..4f47ba652b31 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -9824,6 +9824,7 @@ static int __btrfs_prealloc_file_range(struct inode *inode, int mode,
struct btrfs_root *root = BTRFS_I(inode)->root;
struct btrfs_key ins;
u64 cur_offset = start;
+ u64 clear_offset = start;
u64 i_size;
u64 cur_bytes;
u64 last_alloc = (u64)-1;
@@ -9858,6 +9859,15 @@ static int __btrfs_prealloc_file_range(struct inode *inode, int mode,
btrfs_end_transaction(trans);
break;
}
+
+ /*
+ * We've reserved this space, and thus converted it from
+ * ->bytes_may_use to ->bytes_reserved. Any error that happens
+ * from here on out we will only need to clear our reservation
+ * for the remaining unreserved area, so advance our
+ * clear_offset by our extent size.
+ */
+ clear_offset += ins.offset;
btrfs_dec_block_group_reservations(fs_info, ins.objectid);
last_alloc = ins.offset;
@@ -9937,9 +9947,9 @@ static int __btrfs_prealloc_file_range(struct inode *inode, int mode,
if (own_trans)
btrfs_end_transaction(trans);
}
- if (cur_offset < end)
- btrfs_free_reserved_data_space(inode, NULL, cur_offset,
- end - cur_offset + 1);
+ if (clear_offset < end)
+ btrfs_free_reserved_data_space(inode, NULL, clear_offset,
+ end - clear_offset + 1);
return ret;
}
The patch below does not apply to the 4.4-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
>From b778cf962d71a0e737923d55d0432f3bd287258e Mon Sep 17 00:00:00 2001
From: Josef Bacik <josef(a)toxicpanda.com>
Date: Thu, 13 Feb 2020 10:47:31 -0500
Subject: [PATCH] btrfs: fix bytes_may_use underflow in prealloc error
condtition
I hit the following warning while running my error injection stress
testing:
WARNING: CPU: 3 PID: 1453 at fs/btrfs/space-info.h:108 btrfs_free_reserved_data_space_noquota+0xfd/0x160 [btrfs]
RIP: 0010:btrfs_free_reserved_data_space_noquota+0xfd/0x160 [btrfs]
Call Trace:
btrfs_free_reserved_data_space+0x4f/0x70 [btrfs]
__btrfs_prealloc_file_range+0x378/0x470 [btrfs]
elfcorehdr_read+0x40/0x40
? elfcorehdr_read+0x40/0x40
? btrfs_commit_transaction+0xca/0xa50 [btrfs]
? dput+0xb4/0x2a0
? btrfs_log_dentry_safe+0x55/0x70 [btrfs]
? btrfs_sync_file+0x30e/0x420 [btrfs]
? do_fsync+0x38/0x70
? __x64_sys_fdatasync+0x13/0x20
? do_syscall_64+0x5b/0x1b0
? entry_SYSCALL_64_after_hwframe+0x44/0xa9
This happens if we fail to insert our reserved file extent. At this
point we've already converted our reservation from ->bytes_may_use to
->bytes_reserved. However once we break we will attempt to free
everything from [cur_offset, end] from ->bytes_may_use, but our extent
reservation will overlap part of this.
Fix this problem by adding ins.offset (our extent allocation size) to
cur_offset so we remove the actual remaining part from ->bytes_may_use.
I validated this fix using my inject-error.py script
python inject-error.py -o should_fail_bio -t cache_save_setup -t \
__btrfs_prealloc_file_range \
-t insert_reserved_file_extent.constprop.0 \
-r "-5" ./run-fsstress.sh
where run-fsstress.sh simply mounts and runs fsstress on a disk.
CC: stable(a)vger.kernel.org # 4.4+
Reviewed-by: Qu Wenruo <wqu(a)suse.com>
Signed-off-by: Josef Bacik <josef(a)toxicpanda.com>
Reviewed-by: David Sterba <dsterba(a)suse.com>
Signed-off-by: David Sterba <dsterba(a)suse.com>
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 36deef69f847..4f47ba652b31 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -9824,6 +9824,7 @@ static int __btrfs_prealloc_file_range(struct inode *inode, int mode,
struct btrfs_root *root = BTRFS_I(inode)->root;
struct btrfs_key ins;
u64 cur_offset = start;
+ u64 clear_offset = start;
u64 i_size;
u64 cur_bytes;
u64 last_alloc = (u64)-1;
@@ -9858,6 +9859,15 @@ static int __btrfs_prealloc_file_range(struct inode *inode, int mode,
btrfs_end_transaction(trans);
break;
}
+
+ /*
+ * We've reserved this space, and thus converted it from
+ * ->bytes_may_use to ->bytes_reserved. Any error that happens
+ * from here on out we will only need to clear our reservation
+ * for the remaining unreserved area, so advance our
+ * clear_offset by our extent size.
+ */
+ clear_offset += ins.offset;
btrfs_dec_block_group_reservations(fs_info, ins.objectid);
last_alloc = ins.offset;
@@ -9937,9 +9947,9 @@ static int __btrfs_prealloc_file_range(struct inode *inode, int mode,
if (own_trans)
btrfs_end_transaction(trans);
}
- if (cur_offset < end)
- btrfs_free_reserved_data_space(inode, NULL, cur_offset,
- end - cur_offset + 1);
+ if (clear_offset < end)
+ btrfs_free_reserved_data_space(inode, NULL, clear_offset,
+ end - clear_offset + 1);
return ret;
}
The patch below does not apply to the 4.9-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
>From 81f7eb00ff5bb8326e82503a32809421d14abb8a Mon Sep 17 00:00:00 2001
From: Jeff Mahoney <jeffm(a)suse.com>
Date: Tue, 11 Feb 2020 15:25:37 +0800
Subject: [PATCH] btrfs: destroy qgroup extent records on transaction abort
We clean up the delayed references when we abort a transaction but we
leave the pending qgroup extent records behind, leaking memory.
This patch destroys the extent records when we destroy the delayed refs
and makes sure ensure they're gone before releasing the transaction.
Fixes: 3368d001ba5d ("btrfs: qgroup: Record possible quota-related extent for qgroup.")
CC: stable(a)vger.kernel.org # 4.4+
Reviewed-by: Josef Bacik <josef(a)toxicpanda.com>
Signed-off-by: Jeff Mahoney <jeffm(a)suse.com>
[ Rebased to latest upstream, remove to_qgroup() helper, use
rbtree_postorder_for_each_entry_safe() wrapper ]
Signed-off-by: Qu Wenruo <wqu(a)suse.com>
Reviewed-by: David Sterba <dsterba(a)suse.com>
Signed-off-by: David Sterba <dsterba(a)suse.com>
diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 89422aa8e9d1..d7fec89974cb 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -4276,6 +4276,7 @@ static int btrfs_destroy_delayed_refs(struct btrfs_transaction *trans,
cond_resched();
spin_lock(&delayed_refs->lock);
}
+ btrfs_qgroup_destroy_extent_records(trans);
spin_unlock(&delayed_refs->lock);
diff --git a/fs/btrfs/qgroup.c b/fs/btrfs/qgroup.c
index 98d9a50352d6..ff1870ff3474 100644
--- a/fs/btrfs/qgroup.c
+++ b/fs/btrfs/qgroup.c
@@ -4002,3 +4002,16 @@ int btrfs_qgroup_trace_subtree_after_cow(struct btrfs_trans_handle *trans,
}
return ret;
}
+
+void btrfs_qgroup_destroy_extent_records(struct btrfs_transaction *trans)
+{
+ struct btrfs_qgroup_extent_record *entry;
+ struct btrfs_qgroup_extent_record *next;
+ struct rb_root *root;
+
+ root = &trans->delayed_refs.dirty_extent_root;
+ rbtree_postorder_for_each_entry_safe(entry, next, root, node) {
+ ulist_free(entry->old_roots);
+ kfree(entry);
+ }
+}
diff --git a/fs/btrfs/qgroup.h b/fs/btrfs/qgroup.h
index 236f12224d52..1bc654459469 100644
--- a/fs/btrfs/qgroup.h
+++ b/fs/btrfs/qgroup.h
@@ -414,5 +414,6 @@ int btrfs_qgroup_add_swapped_blocks(struct btrfs_trans_handle *trans,
u64 last_snapshot);
int btrfs_qgroup_trace_subtree_after_cow(struct btrfs_trans_handle *trans,
struct btrfs_root *root, struct extent_buffer *eb);
+void btrfs_qgroup_destroy_extent_records(struct btrfs_transaction *trans);
#endif
diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c
index 33dcc88b428a..beb6c69cd1e5 100644
--- a/fs/btrfs/transaction.c
+++ b/fs/btrfs/transaction.c
@@ -121,6 +121,8 @@ void btrfs_put_transaction(struct btrfs_transaction *transaction)
BUG_ON(!list_empty(&transaction->list));
WARN_ON(!RB_EMPTY_ROOT(
&transaction->delayed_refs.href_root.rb_root));
+ WARN_ON(!RB_EMPTY_ROOT(
+ &transaction->delayed_refs.dirty_extent_root));
if (transaction->delayed_refs.pending_csums)
btrfs_err(transaction->fs_info,
"pending csums is %llu",
The patch below does not apply to the 4.4-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
>From 81f7eb00ff5bb8326e82503a32809421d14abb8a Mon Sep 17 00:00:00 2001
From: Jeff Mahoney <jeffm(a)suse.com>
Date: Tue, 11 Feb 2020 15:25:37 +0800
Subject: [PATCH] btrfs: destroy qgroup extent records on transaction abort
We clean up the delayed references when we abort a transaction but we
leave the pending qgroup extent records behind, leaking memory.
This patch destroys the extent records when we destroy the delayed refs
and makes sure ensure they're gone before releasing the transaction.
Fixes: 3368d001ba5d ("btrfs: qgroup: Record possible quota-related extent for qgroup.")
CC: stable(a)vger.kernel.org # 4.4+
Reviewed-by: Josef Bacik <josef(a)toxicpanda.com>
Signed-off-by: Jeff Mahoney <jeffm(a)suse.com>
[ Rebased to latest upstream, remove to_qgroup() helper, use
rbtree_postorder_for_each_entry_safe() wrapper ]
Signed-off-by: Qu Wenruo <wqu(a)suse.com>
Reviewed-by: David Sterba <dsterba(a)suse.com>
Signed-off-by: David Sterba <dsterba(a)suse.com>
diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 89422aa8e9d1..d7fec89974cb 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -4276,6 +4276,7 @@ static int btrfs_destroy_delayed_refs(struct btrfs_transaction *trans,
cond_resched();
spin_lock(&delayed_refs->lock);
}
+ btrfs_qgroup_destroy_extent_records(trans);
spin_unlock(&delayed_refs->lock);
diff --git a/fs/btrfs/qgroup.c b/fs/btrfs/qgroup.c
index 98d9a50352d6..ff1870ff3474 100644
--- a/fs/btrfs/qgroup.c
+++ b/fs/btrfs/qgroup.c
@@ -4002,3 +4002,16 @@ int btrfs_qgroup_trace_subtree_after_cow(struct btrfs_trans_handle *trans,
}
return ret;
}
+
+void btrfs_qgroup_destroy_extent_records(struct btrfs_transaction *trans)
+{
+ struct btrfs_qgroup_extent_record *entry;
+ struct btrfs_qgroup_extent_record *next;
+ struct rb_root *root;
+
+ root = &trans->delayed_refs.dirty_extent_root;
+ rbtree_postorder_for_each_entry_safe(entry, next, root, node) {
+ ulist_free(entry->old_roots);
+ kfree(entry);
+ }
+}
diff --git a/fs/btrfs/qgroup.h b/fs/btrfs/qgroup.h
index 236f12224d52..1bc654459469 100644
--- a/fs/btrfs/qgroup.h
+++ b/fs/btrfs/qgroup.h
@@ -414,5 +414,6 @@ int btrfs_qgroup_add_swapped_blocks(struct btrfs_trans_handle *trans,
u64 last_snapshot);
int btrfs_qgroup_trace_subtree_after_cow(struct btrfs_trans_handle *trans,
struct btrfs_root *root, struct extent_buffer *eb);
+void btrfs_qgroup_destroy_extent_records(struct btrfs_transaction *trans);
#endif
diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c
index 33dcc88b428a..beb6c69cd1e5 100644
--- a/fs/btrfs/transaction.c
+++ b/fs/btrfs/transaction.c
@@ -121,6 +121,8 @@ void btrfs_put_transaction(struct btrfs_transaction *transaction)
BUG_ON(!list_empty(&transaction->list));
WARN_ON(!RB_EMPTY_ROOT(
&transaction->delayed_refs.href_root.rb_root));
+ WARN_ON(!RB_EMPTY_ROOT(
+ &transaction->delayed_refs.dirty_extent_root));
if (transaction->delayed_refs.pending_csums)
btrfs_err(transaction->fs_info,
"pending csums is %llu",
The patch below does not apply to the 4.19-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
>From 81f7eb00ff5bb8326e82503a32809421d14abb8a Mon Sep 17 00:00:00 2001
From: Jeff Mahoney <jeffm(a)suse.com>
Date: Tue, 11 Feb 2020 15:25:37 +0800
Subject: [PATCH] btrfs: destroy qgroup extent records on transaction abort
We clean up the delayed references when we abort a transaction but we
leave the pending qgroup extent records behind, leaking memory.
This patch destroys the extent records when we destroy the delayed refs
and makes sure ensure they're gone before releasing the transaction.
Fixes: 3368d001ba5d ("btrfs: qgroup: Record possible quota-related extent for qgroup.")
CC: stable(a)vger.kernel.org # 4.4+
Reviewed-by: Josef Bacik <josef(a)toxicpanda.com>
Signed-off-by: Jeff Mahoney <jeffm(a)suse.com>
[ Rebased to latest upstream, remove to_qgroup() helper, use
rbtree_postorder_for_each_entry_safe() wrapper ]
Signed-off-by: Qu Wenruo <wqu(a)suse.com>
Reviewed-by: David Sterba <dsterba(a)suse.com>
Signed-off-by: David Sterba <dsterba(a)suse.com>
diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 89422aa8e9d1..d7fec89974cb 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -4276,6 +4276,7 @@ static int btrfs_destroy_delayed_refs(struct btrfs_transaction *trans,
cond_resched();
spin_lock(&delayed_refs->lock);
}
+ btrfs_qgroup_destroy_extent_records(trans);
spin_unlock(&delayed_refs->lock);
diff --git a/fs/btrfs/qgroup.c b/fs/btrfs/qgroup.c
index 98d9a50352d6..ff1870ff3474 100644
--- a/fs/btrfs/qgroup.c
+++ b/fs/btrfs/qgroup.c
@@ -4002,3 +4002,16 @@ int btrfs_qgroup_trace_subtree_after_cow(struct btrfs_trans_handle *trans,
}
return ret;
}
+
+void btrfs_qgroup_destroy_extent_records(struct btrfs_transaction *trans)
+{
+ struct btrfs_qgroup_extent_record *entry;
+ struct btrfs_qgroup_extent_record *next;
+ struct rb_root *root;
+
+ root = &trans->delayed_refs.dirty_extent_root;
+ rbtree_postorder_for_each_entry_safe(entry, next, root, node) {
+ ulist_free(entry->old_roots);
+ kfree(entry);
+ }
+}
diff --git a/fs/btrfs/qgroup.h b/fs/btrfs/qgroup.h
index 236f12224d52..1bc654459469 100644
--- a/fs/btrfs/qgroup.h
+++ b/fs/btrfs/qgroup.h
@@ -414,5 +414,6 @@ int btrfs_qgroup_add_swapped_blocks(struct btrfs_trans_handle *trans,
u64 last_snapshot);
int btrfs_qgroup_trace_subtree_after_cow(struct btrfs_trans_handle *trans,
struct btrfs_root *root, struct extent_buffer *eb);
+void btrfs_qgroup_destroy_extent_records(struct btrfs_transaction *trans);
#endif
diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c
index 33dcc88b428a..beb6c69cd1e5 100644
--- a/fs/btrfs/transaction.c
+++ b/fs/btrfs/transaction.c
@@ -121,6 +121,8 @@ void btrfs_put_transaction(struct btrfs_transaction *transaction)
BUG_ON(!list_empty(&transaction->list));
WARN_ON(!RB_EMPTY_ROOT(
&transaction->delayed_refs.href_root.rb_root));
+ WARN_ON(!RB_EMPTY_ROOT(
+ &transaction->delayed_refs.dirty_extent_root));
if (transaction->delayed_refs.pending_csums)
btrfs_err(transaction->fs_info,
"pending csums is %llu",
The patch below does not apply to the 4.14-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
>From 81f7eb00ff5bb8326e82503a32809421d14abb8a Mon Sep 17 00:00:00 2001
From: Jeff Mahoney <jeffm(a)suse.com>
Date: Tue, 11 Feb 2020 15:25:37 +0800
Subject: [PATCH] btrfs: destroy qgroup extent records on transaction abort
We clean up the delayed references when we abort a transaction but we
leave the pending qgroup extent records behind, leaking memory.
This patch destroys the extent records when we destroy the delayed refs
and makes sure ensure they're gone before releasing the transaction.
Fixes: 3368d001ba5d ("btrfs: qgroup: Record possible quota-related extent for qgroup.")
CC: stable(a)vger.kernel.org # 4.4+
Reviewed-by: Josef Bacik <josef(a)toxicpanda.com>
Signed-off-by: Jeff Mahoney <jeffm(a)suse.com>
[ Rebased to latest upstream, remove to_qgroup() helper, use
rbtree_postorder_for_each_entry_safe() wrapper ]
Signed-off-by: Qu Wenruo <wqu(a)suse.com>
Reviewed-by: David Sterba <dsterba(a)suse.com>
Signed-off-by: David Sterba <dsterba(a)suse.com>
diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 89422aa8e9d1..d7fec89974cb 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -4276,6 +4276,7 @@ static int btrfs_destroy_delayed_refs(struct btrfs_transaction *trans,
cond_resched();
spin_lock(&delayed_refs->lock);
}
+ btrfs_qgroup_destroy_extent_records(trans);
spin_unlock(&delayed_refs->lock);
diff --git a/fs/btrfs/qgroup.c b/fs/btrfs/qgroup.c
index 98d9a50352d6..ff1870ff3474 100644
--- a/fs/btrfs/qgroup.c
+++ b/fs/btrfs/qgroup.c
@@ -4002,3 +4002,16 @@ int btrfs_qgroup_trace_subtree_after_cow(struct btrfs_trans_handle *trans,
}
return ret;
}
+
+void btrfs_qgroup_destroy_extent_records(struct btrfs_transaction *trans)
+{
+ struct btrfs_qgroup_extent_record *entry;
+ struct btrfs_qgroup_extent_record *next;
+ struct rb_root *root;
+
+ root = &trans->delayed_refs.dirty_extent_root;
+ rbtree_postorder_for_each_entry_safe(entry, next, root, node) {
+ ulist_free(entry->old_roots);
+ kfree(entry);
+ }
+}
diff --git a/fs/btrfs/qgroup.h b/fs/btrfs/qgroup.h
index 236f12224d52..1bc654459469 100644
--- a/fs/btrfs/qgroup.h
+++ b/fs/btrfs/qgroup.h
@@ -414,5 +414,6 @@ int btrfs_qgroup_add_swapped_blocks(struct btrfs_trans_handle *trans,
u64 last_snapshot);
int btrfs_qgroup_trace_subtree_after_cow(struct btrfs_trans_handle *trans,
struct btrfs_root *root, struct extent_buffer *eb);
+void btrfs_qgroup_destroy_extent_records(struct btrfs_transaction *trans);
#endif
diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c
index 33dcc88b428a..beb6c69cd1e5 100644
--- a/fs/btrfs/transaction.c
+++ b/fs/btrfs/transaction.c
@@ -121,6 +121,8 @@ void btrfs_put_transaction(struct btrfs_transaction *transaction)
BUG_ON(!list_empty(&transaction->list));
WARN_ON(!RB_EMPTY_ROOT(
&transaction->delayed_refs.href_root.rb_root));
+ WARN_ON(!RB_EMPTY_ROOT(
+ &transaction->delayed_refs.dirty_extent_root));
if (transaction->delayed_refs.pending_csums)
btrfs_err(transaction->fs_info,
"pending csums is %llu",
The patch below does not apply to the 4.4-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
>From 91a5f413af596ad01097e59bf487eb07cb3f1331 Mon Sep 17 00:00:00 2001
From: Vitaly Kuznetsov <vkuznets(a)redhat.com>
Date: Thu, 20 Feb 2020 18:22:05 +0100
Subject: [PATCH] KVM: nVMX: handle nested posted interrupts when apicv is
disabled for L1
Even when APICv is disabled for L1 it can (and, actually, is) still
available for L2, this means we need to always call
vmx_deliver_nested_posted_interrupt() when attempting an interrupt
delivery.
Suggested-by: Paolo Bonzini <pbonzini(a)redhat.com>
Signed-off-by: Vitaly Kuznetsov <vkuznets(a)redhat.com>
Cc: stable(a)vger.kernel.org
Signed-off-by: Paolo Bonzini <pbonzini(a)redhat.com>
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 40a0c0fd95ca..a84e8c5acda8 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1146,7 +1146,7 @@ struct kvm_x86_ops {
void (*load_eoi_exitmap)(struct kvm_vcpu *vcpu, u64 *eoi_exit_bitmap);
void (*set_virtual_apic_mode)(struct kvm_vcpu *vcpu);
void (*set_apic_access_page_addr)(struct kvm_vcpu *vcpu, hpa_t hpa);
- void (*deliver_posted_interrupt)(struct kvm_vcpu *vcpu, int vector);
+ int (*deliver_posted_interrupt)(struct kvm_vcpu *vcpu, int vector);
int (*sync_pir_to_irr)(struct kvm_vcpu *vcpu);
int (*set_tss_addr)(struct kvm *kvm, unsigned int addr);
int (*set_identity_map_addr)(struct kvm *kvm, u64 ident_addr);
diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
index afcd30d44cbb..cc8ee8125712 100644
--- a/arch/x86/kvm/lapic.c
+++ b/arch/x86/kvm/lapic.c
@@ -1046,11 +1046,8 @@ static int __apic_accept_irq(struct kvm_lapic *apic, int delivery_mode,
apic->regs + APIC_TMR);
}
- if (vcpu->arch.apicv_active)
- kvm_x86_ops->deliver_posted_interrupt(vcpu, vector);
- else {
+ if (kvm_x86_ops->deliver_posted_interrupt(vcpu, vector)) {
kvm_lapic_set_irr(vector, apic);
-
kvm_make_request(KVM_REQ_EVENT, vcpu);
kvm_vcpu_kick(vcpu);
}
diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c
index a391b29138f0..8787a123b8e7 100644
--- a/arch/x86/kvm/svm.c
+++ b/arch/x86/kvm/svm.c
@@ -5258,8 +5258,11 @@ static void svm_load_eoi_exitmap(struct kvm_vcpu *vcpu, u64 *eoi_exit_bitmap)
return;
}
-static void svm_deliver_avic_intr(struct kvm_vcpu *vcpu, int vec)
+static int svm_deliver_avic_intr(struct kvm_vcpu *vcpu, int vec)
{
+ if (!vcpu->arch.apicv_active)
+ return -1;
+
kvm_lapic_set_irr(vec, vcpu->arch.apic);
smp_mb__after_atomic();
@@ -5271,6 +5274,8 @@ static void svm_deliver_avic_intr(struct kvm_vcpu *vcpu, int vec)
put_cpu();
} else
kvm_vcpu_wake_up(vcpu);
+
+ return 0;
}
static bool svm_dy_apicv_has_pending_interrupt(struct kvm_vcpu *vcpu)
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index dafe4df893c8..63ccc435a602 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -3822,24 +3822,29 @@ static int vmx_deliver_nested_posted_interrupt(struct kvm_vcpu *vcpu,
* 2. If target vcpu isn't running(root mode), kick it to pick up the
* interrupt from PIR in next vmentry.
*/
-static void vmx_deliver_posted_interrupt(struct kvm_vcpu *vcpu, int vector)
+static int vmx_deliver_posted_interrupt(struct kvm_vcpu *vcpu, int vector)
{
struct vcpu_vmx *vmx = to_vmx(vcpu);
int r;
r = vmx_deliver_nested_posted_interrupt(vcpu, vector);
if (!r)
- return;
+ return 0;
+
+ if (!vcpu->arch.apicv_active)
+ return -1;
if (pi_test_and_set_pir(vector, &vmx->pi_desc))
- return;
+ return 0;
/* If a previous notification has sent the IPI, nothing to do. */
if (pi_test_and_set_on(&vmx->pi_desc))
- return;
+ return 0;
if (!kvm_vcpu_trigger_posted_interrupt(vcpu, false))
kvm_vcpu_kick(vcpu);
+
+ return 0;
}
/*