This is the start of the stable review cycle for the 5.10.136 release.
There are 23 patches in this series, all will be posted as a response
to this one. If anyone has any issues with these being applied, please
let me know.
Responses should be made by Thu, 11 Aug 2022 17:55:02 +0000.
Anything received after that time might be too late.
The whole patch series can be found in one patch at:
https://www.kernel.org/pub/linux/kernel/v5.x/stable-review/patch-5.10.136-r…
or in the git tree and branch at:
git://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable-rc.git linux-5.10.y
and the diffstat can be found below.
thanks,
greg k-h
-------------
Pseudo-Shortlog of commits:
Greg Kroah-Hartman <gregkh(a)linuxfoundation.org>
Linux 5.10.136-rc1
Pawan Gupta <pawan.kumar.gupta(a)linux.intel.com>
x86/speculation: Add LFENCE to RSB fill sequence
Daniel Sneddon <daniel.sneddon(a)linux.intel.com>
x86/speculation: Add RSB VM Exit protections
Ning Qiang <sohu0106(a)126.com>
macintosh/adb: fix oob read in do_adb_query() function
Hilda Wu <hildawu(a)realtek.com>
Bluetooth: btusb: Add Realtek RTL8852C support ID 0x13D3:0x3586
Hilda Wu <hildawu(a)realtek.com>
Bluetooth: btusb: Add Realtek RTL8852C support ID 0x13D3:0x3587
Hilda Wu <hildawu(a)realtek.com>
Bluetooth: btusb: Add Realtek RTL8852C support ID 0x0CB8:0xC558
Hilda Wu <hildawu(a)realtek.com>
Bluetooth: btusb: Add Realtek RTL8852C support ID 0x04C5:0x1675
Hilda Wu <hildawu(a)realtek.com>
Bluetooth: btusb: Add Realtek RTL8852C support ID 0x04CA:0x4007
Aaron Ma <aaron.ma(a)canonical.com>
Bluetooth: btusb: Add support of IMC Networks PID 0x3568
Hakan Jansson <hakan.jansson(a)infineon.com>
Bluetooth: hci_bcm: Add DT compatible for CYW55572
Ahmad Fatoum <a.fatoum(a)pengutronix.de>
Bluetooth: hci_bcm: Add BCM4349B1 variant
Raghavendra Rao Ananta <rananta(a)google.com>
selftests: KVM: Handle compiler optimizations in ucall
Dmitry Klochkov <kdmitry556(a)gmail.com>
tools/kvm_stat: fix display of error when multiple processes are found
GUO Zihua <guozihua(a)huawei.com>
crypto: arm64/poly1305 - fix a read out-of-bound
Tony Luck <tony.luck(a)intel.com>
ACPI: APEI: Better fix to avoid spamming the console with old error logs
Werner Sembach <wse(a)tuxedocomputers.com>
ACPI: video: Shortening quirk list by identifying Clevo by board_name only
Werner Sembach <wse(a)tuxedocomputers.com>
ACPI: video: Force backlight native for some TongFang devices
George Kennedy <george.kennedy(a)oracle.com>
tun: avoid double free in tun_free_netdev
Jakub Sitnicki <jakub(a)cloudflare.com>
selftests/bpf: Check dst_port only on the client socket
Jakub Sitnicki <jakub(a)cloudflare.com>
selftests/bpf: Extend verifier and bpf_sock tests for dst_port loads
Tetsuo Handa <penguin-kernel(a)I-love.SAKURA.ne.jp>
ath9k_htc: fix NULL pointer dereference at ath9k_htc_tx_get_packet()
Tetsuo Handa <penguin-kernel(a)I-love.SAKURA.ne.jp>
ath9k_htc: fix NULL pointer dereference at ath9k_htc_rxep()
Ben Hutchings <ben(a)decadent.org.uk>
x86/speculation: Make all RETbleed mitigations 64-bit only
-------------
Diffstat:
Documentation/admin-guide/hw-vuln/spectre.rst | 8 ++
Makefile | 4 +-
arch/arm64/crypto/poly1305-glue.c | 2 +-
arch/x86/Kconfig | 8 +-
arch/x86/include/asm/cpufeatures.h | 2 +
arch/x86/include/asm/msr-index.h | 4 +
arch/x86/include/asm/nospec-branch.h | 21 +++-
arch/x86/kernel/cpu/bugs.c | 86 +++++++++++-----
arch/x86/kernel/cpu/common.c | 12 ++-
arch/x86/kvm/vmx/vmenter.S | 8 +-
drivers/acpi/apei/bert.c | 31 ++++--
drivers/acpi/video_detect.c | 55 ++++++----
drivers/bluetooth/btbcm.c | 2 +
drivers/bluetooth/btusb.c | 15 +++
drivers/bluetooth/hci_bcm.c | 2 +
drivers/macintosh/adb.c | 2 +-
drivers/net/tun.c | 114 +++++++++++----------
drivers/net/wireless/ath/ath9k/htc.h | 2 +
drivers/net/wireless/ath/ath9k/htc_drv_txrx.c | 13 +++
drivers/net/wireless/ath/ath9k/wmi.c | 4 +
tools/arch/x86/include/asm/cpufeatures.h | 1 +
tools/arch/x86/include/asm/msr-index.h | 4 +
tools/include/uapi/linux/bpf.h | 3 +-
tools/kvm/kvm_stat/kvm_stat | 3 +-
.../testing/selftests/bpf/prog_tests/sock_fields.c | 60 +++++++----
.../testing/selftests/bpf/progs/test_sock_fields.c | 45 ++++++++
tools/testing/selftests/bpf/verifier/sock.c | 81 ++++++++++++++-
tools/testing/selftests/kvm/lib/aarch64/ucall.c | 9 +-
28 files changed, 451 insertions(+), 150 deletions(-)
This is the start of the stable review cycle for the 5.18.17 release.
There are 35 patches in this series, all will be posted as a response
to this one. If anyone has any issues with these being applied, please
let me know.
Responses should be made by Thu, 11 Aug 2022 17:55:02 +0000.
Anything received after that time might be too late.
The whole patch series can be found in one patch at:
https://www.kernel.org/pub/linux/kernel/v5.x/stable-review/patch-5.18.17-rc…
or in the git tree and branch at:
git://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable-rc.git linux-5.18.y
and the diffstat can be found below.
thanks,
greg k-h
-------------
Pseudo-Shortlog of commits:
Greg Kroah-Hartman <gregkh(a)linuxfoundation.org>
Linux 5.18.17-rc1
Pawan Gupta <pawan.kumar.gupta(a)linux.intel.com>
x86/speculation: Add LFENCE to RSB fill sequence
Daniel Sneddon <daniel.sneddon(a)linux.intel.com>
x86/speculation: Add RSB VM Exit protections
Ning Qiang <sohu0106(a)126.com>
macintosh/adb: fix oob read in do_adb_query() function
Hilda Wu <hildawu(a)realtek.com>
Bluetooth: btusb: Add Realtek RTL8852C support ID 0x13D3:0x3586
Hilda Wu <hildawu(a)realtek.com>
Bluetooth: btusb: Add Realtek RTL8852C support ID 0x13D3:0x3587
Hilda Wu <hildawu(a)realtek.com>
Bluetooth: btusb: Add Realtek RTL8852C support ID 0x0CB8:0xC558
Hilda Wu <hildawu(a)realtek.com>
Bluetooth: btusb: Add Realtek RTL8852C support ID 0x04C5:0x1675
Hilda Wu <hildawu(a)realtek.com>
Bluetooth: btusb: Add Realtek RTL8852C support ID 0x04CA:0x4007
Aaron Ma <aaron.ma(a)canonical.com>
Bluetooth: btusb: Add support of IMC Networks PID 0x3568
Ahmad Fatoum <a.fatoum(a)pengutronix.de>
dt-bindings: bluetooth: broadcom: Add BCM4349B1 DT binding
Hakan Jansson <hakan.jansson(a)infineon.com>
Bluetooth: hci_bcm: Add DT compatible for CYW55572
Ahmad Fatoum <a.fatoum(a)pengutronix.de>
Bluetooth: hci_bcm: Add BCM4349B1 variant
Sai Teja Aluvala <quic_saluvala(a)quicinc.com>
Bluetooth: hci_qca: Return wakeup for qca_wakeup
Naohiro Aota <naohiro.aota(a)wdc.com>
btrfs: zoned: drop optimization of zone finish
Naohiro Aota <naohiro.aota(a)wdc.com>
btrfs: zoned: fix critical section of relocation inode writeback
Naohiro Aota <naohiro.aota(a)wdc.com>
btrfs: zoned: prevent allocation from previous data relocation BG
Peter Collingbourne <pcc(a)google.com>
arm64: set UXN on swapper page tables
Mingwei Zhang <mizhang(a)google.com>
KVM: x86/svm: add __GFP_ACCOUNT to __sev_dbg_{en,de}crypt_user()
Raghavendra Rao Ananta <rananta(a)google.com>
selftests: KVM: Handle compiler optimizations in ucall
Dmitry Klochkov <kdmitry556(a)gmail.com>
tools/kvm_stat: fix display of error when multiple processes are found
David Matlack <dmatlack(a)google.com>
KVM: selftests: Restrict test region to 48-bit physical addresses when using nested
Maxim Levitsky <mlevitsk(a)redhat.com>
KVM: x86: disable preemption around the call to kvm_arch_vcpu_{un|}blocking
Maxim Levitsky <mlevitsk(a)redhat.com>
KVM: x86: disable preemption while updating apicv inhibition
Seth Forshee <sforshee(a)digitalocean.com>
entry/kvm: Exit to user mode when TIF_NOTIFY_SIGNAL is set
Ben Gardon <bgardon(a)google.com>
KVM: x86/MMU: Zap non-leaf SPTEs when disabling dirty logging
Vitaly Kuznetsov <vkuznets(a)redhat.com>
KVM: selftests: Make hyperv_clock selftest more stable
Paolo Bonzini <pbonzini(a)redhat.com>
KVM: x86: do not set st->preempted when going back to user space
Paolo Bonzini <pbonzini(a)redhat.com>
KVM: x86: do not report a vCPU as preempted outside instruction boundaries
GUO Zihua <guozihua(a)huawei.com>
crypto: arm64/poly1305 - fix a read out-of-bound
Tony Luck <tony.luck(a)intel.com>
ACPI: APEI: Better fix to avoid spamming the console with old error logs
Werner Sembach <wse(a)tuxedocomputers.com>
ACPI: video: Shortening quirk list by identifying Clevo by board_name only
Werner Sembach <wse(a)tuxedocomputers.com>
ACPI: video: Force backlight native for some TongFang devices
Stéphane Graber <stgraber(a)ubuntu.com>
tools/vm/slabinfo: Handle files in debugfs
Jan Kara <jack(a)suse.cz>
block: fix default IO priority handling again
Ben Hutchings <ben(a)decadent.org.uk>
x86/speculation: Make all RETbleed mitigations 64-bit only
-------------
Diffstat:
Documentation/admin-guide/hw-vuln/spectre.rst | 8 ++
.../bindings/net/broadcom-bluetooth.yaml | 1 +
Makefile | 4 +-
arch/arm64/crypto/poly1305-glue.c | 2 +-
arch/arm64/include/asm/kernel-pgtable.h | 4 +-
arch/arm64/kernel/head.S | 2 +-
arch/x86/Kconfig | 8 +-
arch/x86/include/asm/cpufeatures.h | 2 +
arch/x86/include/asm/kvm_host.h | 3 +
arch/x86/include/asm/msr-index.h | 4 +
arch/x86/include/asm/nospec-branch.h | 21 +++++-
arch/x86/kernel/cpu/bugs.c | 86 ++++++++++++++++------
arch/x86/kernel/cpu/common.c | 12 ++-
arch/x86/kvm/mmu/tdp_iter.c | 9 +++
arch/x86/kvm/mmu/tdp_iter.h | 1 +
arch/x86/kvm/mmu/tdp_mmu.c | 38 ++++++++--
arch/x86/kvm/svm/sev.c | 4 +-
arch/x86/kvm/svm/svm.c | 2 +
arch/x86/kvm/vmx/vmenter.S | 8 +-
arch/x86/kvm/vmx/vmx.c | 1 +
arch/x86/kvm/x86.c | 50 ++++++++++---
arch/x86/kvm/xen.h | 6 +-
block/blk-ioc.c | 2 +
block/ioprio.c | 4 +-
drivers/acpi/apei/bert.c | 31 ++++++--
drivers/acpi/video_detect.c | 55 +++++++++-----
drivers/bluetooth/btbcm.c | 2 +
drivers/bluetooth/btusb.c | 15 ++++
drivers/bluetooth/hci_bcm.c | 2 +
drivers/bluetooth/hci_qca.c | 2 +-
drivers/macintosh/adb.c | 2 +-
fs/btrfs/block-group.h | 1 +
fs/btrfs/extent-tree.c | 20 ++++-
fs/btrfs/extent_io.c | 3 +-
fs/btrfs/inode.c | 2 +
fs/btrfs/zoned.c | 50 +++++++++++--
fs/btrfs/zoned.h | 5 ++
include/linux/ioprio.h | 2 +-
kernel/entry/kvm.c | 6 --
tools/arch/x86/include/asm/cpufeatures.h | 1 +
tools/arch/x86/include/asm/msr-index.h | 4 +
tools/kvm/kvm_stat/kvm_stat | 3 +-
tools/testing/selftests/kvm/lib/aarch64/ucall.c | 9 +--
tools/testing/selftests/kvm/lib/perf_test_util.c | 18 ++++-
tools/testing/selftests/kvm/x86_64/hyperv_clock.c | 10 ++-
tools/vm/slabinfo.c | 26 ++++++-
virt/kvm/kvm_main.c | 8 +-
47 files changed, 434 insertions(+), 125 deletions(-)
Hi Greg,
This backport series contains small fixes from v5.15 release.
From this point on, 5.10.y xfs can follow and pick changes
posted to 5.15.y.
I already have some debt of fixes from v5.17 already applied to
5.15.y, but not yet submitted to 5.10.y - those will be included
in my next batch.
Thanks,
Amir.
Changes from [v1]:
- Drop backport that disallows disabling of quota accounting
on a mounted xfs (Darrick)
- Added Acked-by Darrick
- CC stable
[v1] https://lore.kernel.org/linux-xfs/20220809111708.92768-1-amir73il@gmail.com/
Darrick J. Wong (1):
xfs: only set IOMAP_F_SHARED when providing a srcmap to a write
Dave Chinner (2):
mm: Add kvrealloc()
xfs: fix I_DONTCACHE
fs/xfs/xfs_icache.c | 3 ++-
fs/xfs/xfs_iomap.c | 8 ++++----
fs/xfs/xfs_iops.c | 2 +-
fs/xfs/xfs_log_recover.c | 4 +++-
include/linux/mm.h | 2 ++
mm/util.c | 15 +++++++++++++++
6 files changed, 27 insertions(+), 7 deletions(-)
--
2.25.1
Staring at hugetlb_wp(), one might wonder where all the logic for shared
mappings is when stumbling over a write-protected page in a shared
mapping. In fact, there is none, and so far we thought we could get
away with that because e.g., mprotect() should always do the right thing
and map all pages directly writable.
Looks like we were wrong:
--------------------------------------------------------------------------
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <fcntl.h>
#include <unistd.h>
#include <errno.h>
#include <sys/mman.h>
#define HUGETLB_SIZE (2 * 1024 * 1024u)
static void clear_softdirty(void)
{
int fd = open("/proc/self/clear_refs", O_WRONLY);
const char *ctrl = "4";
int ret;
if (fd < 0) {
fprintf(stderr, "open(clear_refs) failed\n");
exit(1);
}
ret = write(fd, ctrl, strlen(ctrl));
if (ret != strlen(ctrl)) {
fprintf(stderr, "write(clear_refs) failed\n");
exit(1);
}
close(fd);
}
int main(int argc, char **argv)
{
char *map;
int fd;
fd = open("/dev/hugepages/tmp", O_RDWR | O_CREAT);
if (!fd) {
fprintf(stderr, "open() failed\n");
return -errno;
}
if (ftruncate(fd, HUGETLB_SIZE)) {
fprintf(stderr, "ftruncate() failed\n");
return -errno;
}
map = mmap(NULL, HUGETLB_SIZE, PROT_READ|PROT_WRITE, MAP_SHARED, fd, 0);
if (map == MAP_FAILED) {
fprintf(stderr, "mmap() failed\n");
return -errno;
}
*map = 0;
if (mprotect(map, HUGETLB_SIZE, PROT_READ)) {
fprintf(stderr, "mmprotect() failed\n");
return -errno;
}
clear_softdirty();
if (mprotect(map, HUGETLB_SIZE, PROT_READ|PROT_WRITE)) {
fprintf(stderr, "mmprotect() failed\n");
return -errno;
}
*map = 0;
return 0;
}
--------------------------------------------------------------------------
Above test fails with SIGBUS when there is only a single free hugetlb page.
# echo 1 > /sys/kernel/mm/hugepages/hugepages-2048kB/nr_hugepages
# ./test
Bus error (core dumped)
And worse, with sufficient free hugetlb pages it will map an anonymous page
into a shared mapping, for example, messing up accounting during unmap
and breaking MAP_SHARED semantics:
# echo 2 > /sys/kernel/mm/hugepages/hugepages-2048kB/nr_hugepages
# ./test
# cat /proc/meminfo | grep HugePages_
HugePages_Total: 2
HugePages_Free: 1
HugePages_Rsvd: 18446744073709551615
HugePages_Surp: 0
Reason in this particular case is that vma_wants_writenotify() will
return "true", removing VM_SHARED in vma_set_page_prot() to map pages
write-protected. Let's teach vma_wants_writenotify() that hugetlb does not
support write-notify, including softdirty tracking.
Fixes: 64e455079e1b ("mm: softdirty: enable write notifications on VMAs after VM_SOFTDIRTY cleared")
Cc: <stable(a)vger.kernel.org> # v3.18+
Signed-off-by: David Hildenbrand <david(a)redhat.com>
---
mm/mmap.c | 7 +++++++
1 file changed, 7 insertions(+)
diff --git a/mm/mmap.c b/mm/mmap.c
index 61e6135c54ef..462a6b0344ac 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -1683,6 +1683,13 @@ int vma_wants_writenotify(struct vm_area_struct *vma, pgprot_t vm_page_prot)
if ((vm_flags & (VM_WRITE|VM_SHARED)) != ((VM_WRITE|VM_SHARED)))
return 0;
+ /*
+ * Hugetlb does not require/support writenotify; especially, it does not
+ * support softdirty tracking.
+ */
+ if (is_vm_hugetlb_page(vma))
+ return 0;
+
/* The backer wishes to know when pages are first written to? */
if (vm_ops && (vm_ops->page_mkwrite || vm_ops->pfn_mkwrite))
return 1;
--
2.35.3
Ever since the Dirty COW (CVE-2016-5195) security issue happened, we know
that FOLL_FORCE can be possibly dangerous, especially if there are races
that can be exploited by user space.
Right now, it would be sufficient to have some code that sets a PTE of
a R/O-mapped shared page dirty, in order for it to erroneously become
writable by FOLL_FORCE. The implications of setting a write-protected PTE
dirty might not be immediately obvious to everyone.
And in fact ever since commit 9ae0f87d009c ("mm/shmem: unconditionally set
pte dirty in mfill_atomic_install_pte"), we can use UFFDIO_CONTINUE to map
a shmem page R/O while marking the pte dirty. This can be used by
unprivileged user space to modify tmpfs/shmem file content even if the user
does not have write permissions to the file, and to bypass memfd write
sealing -- Dirty COW restricted to tmpfs/shmem (CVE-2022-2590).
To fix such security issues for good, the insight is that we really only
need that fancy retry logic (FOLL_COW) for COW mappings that are not
writable (!VM_WRITE). And in a COW mapping, we really only broke COW if
we have an exclusive anonymous page mapped. If we have something else
mapped, or the mapped anonymous page might be shared (!PageAnonExclusive),
we have to trigger a write fault to break COW. If we don't find an
exclusive anonymous page when we retry, we have to trigger COW breaking
once again because something intervened.
Let's move away from this mandatory-retry + dirty handling and rely on
our PageAnonExclusive() flag for making a similar decision, to use the
same COW logic as in other kernel parts here as well. In case we stumble
over a PTE in a COW mapping that does not map an exclusive anonymous page,
COW was not properly broken and we have to trigger a fake write-fault to
break COW.
Just like we do in can_change_pte_writable() added via
commit 64fe24a3e05e ("mm/mprotect: try avoiding write faults for exclusive
anonymous pages when changing protection") and commit 76aefad628aa
("mm/mprotect: fix soft-dirty check in can_change_pte_writable()"), take
care of softdirty and uffd-wp manually.
For example, a write() via /proc/self/mem to a uffd-wp-protected range has
to fail instead of silently granting write access and bypassing the
userspace fault handler. Note that FOLL_FORCE is not only used for debug
access, but also triggered by applications without debug intentions, for
example, when pinning pages via RDMA.
This fixes CVE-2022-2590. Note that only x86_64 and aarch64 are
affected, because only those support CONFIG_HAVE_ARCH_USERFAULTFD_MINOR.
Fortunately, FOLL_COW is no longer required to handle FOLL_FORCE. So
let's just get rid of it.
Thanks to Nadav Amit for pointing out that the pte_dirty() check in
FOLL_FORCE code is problematic and might be exploitable.
Note 1: We don't check for the PTE being dirty because it doesn't matter
for making a "was COWed" decision anymore, and whoever modifies the
page has to set the page dirty either way.
Note 2: Kernels before extended uffd-wp support and before
PageAnonExclusive (< 5.19) can simply revert the problematic
commit instead and be safe regarding UFFDIO_CONTINUE. A backport to
v5.19 requires minor adjustments due to lack of
vma_soft_dirty_enabled().
Fixes: 9ae0f87d009c ("mm/shmem: unconditionally set pte dirty in mfill_atomic_install_pte")
Cc: <stable(a)vger.kernel.org> # 5.16+
Cc: Linus Torvalds <torvalds(a)linux-foundation.org>
Cc: Andrew Morton <akpm(a)linux-foundation.org>
Cc: Greg Kroah-Hartman <gregkh(a)linuxfoundation.org>
Cc: Axel Rasmussen <axelrasmussen(a)google.com>
Cc: Nadav Amit <nadav.amit(a)gmail.com>
Cc: Peter Xu <peterx(a)redhat.com>
Cc: Hugh Dickins <hughd(a)google.com>
Cc: Andrea Arcangeli <aarcange(a)redhat.com>
Cc: Matthew Wilcox <willy(a)infradead.org>
Cc: Vlastimil Babka <vbabka(a)suse.cz>
Cc: John Hubbard <jhubbard(a)nvidia.com>
Cc: Jason Gunthorpe <jgg(a)nvidia.com>
Signed-off-by: David Hildenbrand <david(a)redhat.com>
---
v1 -> v2:
- Make the code easier to digest and even more error prone by performing
more explicit checks, just failing gracefully and adding better comments.
- Avoid introducing new VM_BUG_ON().
- Mention Nadav's participation in the description
- Mention that we can bypass memfd write sealing
---
include/linux/mm.h | 1 -
mm/gup.c | 68 +++++++++++++++++++++++++++++++---------------
mm/huge_memory.c | 64 +++++++++++++++++++++++++++++--------------
3 files changed, 89 insertions(+), 44 deletions(-)
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 18e01474cf6b..2222ed598112 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2885,7 +2885,6 @@ struct page *follow_page(struct vm_area_struct *vma, unsigned long address,
#define FOLL_MIGRATION 0x400 /* wait for page to replace migration entry */
#define FOLL_TRIED 0x800 /* a retry, previous pass started an IO */
#define FOLL_REMOTE 0x2000 /* we are working on non-current tsk/mm */
-#define FOLL_COW 0x4000 /* internal GUP flag */
#define FOLL_ANON 0x8000 /* don't do file mappings */
#define FOLL_LONGTERM 0x10000 /* mapping lifetime is indefinite: see below */
#define FOLL_SPLIT_PMD 0x20000 /* split huge pmd before returning */
diff --git a/mm/gup.c b/mm/gup.c
index 732825157430..5abdaf487460 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -478,14 +478,42 @@ static int follow_pfn_pte(struct vm_area_struct *vma, unsigned long address,
return -EEXIST;
}
-/*
- * FOLL_FORCE can write to even unwritable pte's, but only
- * after we've gone through a COW cycle and they are dirty.
- */
-static inline bool can_follow_write_pte(pte_t pte, unsigned int flags)
+/* FOLL_FORCE can write to even unwritable PTEs in COW mappings. */
+static inline bool can_follow_write_pte(pte_t pte, struct page *page,
+ struct vm_area_struct *vma,
+ unsigned int flags)
{
- return pte_write(pte) ||
- ((flags & FOLL_FORCE) && (flags & FOLL_COW) && pte_dirty(pte));
+ /* If the pte is writable, we can write to the page. */
+ if (pte_write(pte))
+ return true;
+
+ /* Maybe FOLL_FORCE is set to override it? */
+ if (!(flags & FOLL_FORCE))
+ return false;
+
+ /* But FOLL_FORCE has no effect on shared mappings */
+ if (vma->vm_flags & (VM_MAYSHARE | VM_SHARED))
+ return false;
+
+ /* ... or read-only private ones */
+ if (!(vma->vm_flags & VM_MAYWRITE))
+ return false;
+
+ /* ... or already writable ones that just need to take a write fault */
+ if (vma->vm_flags & VM_WRITE)
+ return false;
+
+ /*
+ * See can_change_pte_writable(): we broke COW and could map the page
+ * writable if we have an exclusive anonymous page ...
+ */
+ if (!page || !PageAnon(page) || !PageAnonExclusive(page))
+ return false;
+
+ /* ... and a write-fault isn't required for other reasons. */
+ if (vma_soft_dirty_enabled(vma) && !pte_soft_dirty(pte))
+ return false;
+ return !userfaultfd_pte_wp(vma, pte);
}
static struct page *follow_page_pte(struct vm_area_struct *vma,
@@ -528,12 +556,19 @@ static struct page *follow_page_pte(struct vm_area_struct *vma,
}
if ((flags & FOLL_NUMA) && pte_protnone(pte))
goto no_page;
- if ((flags & FOLL_WRITE) && !can_follow_write_pte(pte, flags)) {
- pte_unmap_unlock(ptep, ptl);
- return NULL;
- }
page = vm_normal_page(vma, address, pte);
+
+ /*
+ * We only care about anon pages in can_follow_write_pte() and don't
+ * have to worry about pte_devmap() because they are never anon.
+ */
+ if ((flags & FOLL_WRITE) &&
+ !can_follow_write_pte(pte, page, vma, flags)) {
+ page = NULL;
+ goto out;
+ }
+
if (!page && pte_devmap(pte) && (flags & (FOLL_GET | FOLL_PIN))) {
/*
* Only return device mapping pages in the FOLL_GET or FOLL_PIN
@@ -986,17 +1021,6 @@ static int faultin_page(struct vm_area_struct *vma,
return -EBUSY;
}
- /*
- * The VM_FAULT_WRITE bit tells us that do_wp_page has broken COW when
- * necessary, even if maybe_mkwrite decided not to set pte_write. We
- * can thus safely do subsequent page lookups as if they were reads.
- * But only do so when looping for pte_write is futile: in some cases
- * userspace may also be wanting to write to the gotten user page,
- * which a read fault here might prevent (a readonly page might get
- * reCOWed by userspace write).
- */
- if ((ret & VM_FAULT_WRITE) && !(vma->vm_flags & VM_WRITE))
- *flags |= FOLL_COW;
return 0;
}
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 8a7c1b344abe..e9414ee57c5b 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1040,12 +1040,6 @@ struct page *follow_devmap_pmd(struct vm_area_struct *vma, unsigned long addr,
assert_spin_locked(pmd_lockptr(mm, pmd));
- /*
- * When we COW a devmap PMD entry, we split it into PTEs, so we should
- * not be in this function with `flags & FOLL_COW` set.
- */
- WARN_ONCE(flags & FOLL_COW, "mm: In follow_devmap_pmd with FOLL_COW set");
-
/* FOLL_GET and FOLL_PIN are mutually exclusive. */
if (WARN_ON_ONCE((flags & (FOLL_PIN | FOLL_GET)) ==
(FOLL_PIN | FOLL_GET)))
@@ -1395,14 +1389,42 @@ vm_fault_t do_huge_pmd_wp_page(struct vm_fault *vmf)
return VM_FAULT_FALLBACK;
}
-/*
- * FOLL_FORCE can write to even unwritable pmd's, but only
- * after we've gone through a COW cycle and they are dirty.
- */
-static inline bool can_follow_write_pmd(pmd_t pmd, unsigned int flags)
+/* FOLL_FORCE can write to even unwritable PMDs in COW mappings. */
+static inline bool can_follow_write_pmd(pmd_t pmd, struct page *page,
+ struct vm_area_struct *vma,
+ unsigned int flags)
{
- return pmd_write(pmd) ||
- ((flags & FOLL_FORCE) && (flags & FOLL_COW) && pmd_dirty(pmd));
+ /* If the pmd is writable, we can write to the page. */
+ if (pmd_write(pmd))
+ return true;
+
+ /* Maybe FOLL_FORCE is set to override it? */
+ if (!(flags & FOLL_FORCE))
+ return false;
+
+ /* But FOLL_FORCE has no effect on shared mappings */
+ if (vma->vm_flags & (VM_MAYSHARE | VM_SHARED))
+ return false;
+
+ /* ... or read-only private ones */
+ if (!(vma->vm_flags & VM_MAYWRITE))
+ return false;
+
+ /* ... or already writable ones that just need to take a write fault */
+ if (vma->vm_flags & VM_WRITE)
+ return false;
+
+ /*
+ * See can_change_pte_writable(): we broke COW and could map the page
+ * writable if we have an exclusive anonymous page ...
+ */
+ if (!page || !PageAnon(page) || !PageAnonExclusive(page))
+ return false;
+
+ /* ... and a write-fault isn't required for other reasons. */
+ if (vma_soft_dirty_enabled(vma) && !pmd_soft_dirty(pmd))
+ return false;
+ return !userfaultfd_huge_pmd_wp(vma, pmd);
}
struct page *follow_trans_huge_pmd(struct vm_area_struct *vma,
@@ -1411,12 +1433,16 @@ struct page *follow_trans_huge_pmd(struct vm_area_struct *vma,
unsigned int flags)
{
struct mm_struct *mm = vma->vm_mm;
- struct page *page = NULL;
+ struct page *page;
assert_spin_locked(pmd_lockptr(mm, pmd));
- if (flags & FOLL_WRITE && !can_follow_write_pmd(*pmd, flags))
- goto out;
+ page = pmd_page(*pmd);
+ VM_BUG_ON_PAGE(!PageHead(page) && !is_zone_device_page(page), page);
+
+ if ((flags & FOLL_WRITE) &&
+ !can_follow_write_pmd(*pmd, page, vma, flags))
+ return NULL;
/* Avoid dumping huge zero page */
if ((flags & FOLL_DUMP) && is_huge_zero_pmd(*pmd))
@@ -1424,10 +1450,7 @@ struct page *follow_trans_huge_pmd(struct vm_area_struct *vma,
/* Full NUMA hinting faults to serialise migration in fault paths */
if ((flags & FOLL_NUMA) && pmd_protnone(*pmd))
- goto out;
-
- page = pmd_page(*pmd);
- VM_BUG_ON_PAGE(!PageHead(page) && !is_zone_device_page(page), page);
+ return NULL;
if (!pmd_write(*pmd) && gup_must_unshare(flags, page))
return ERR_PTR(-EMLINK);
@@ -1444,7 +1467,6 @@ struct page *follow_trans_huge_pmd(struct vm_area_struct *vma,
page += (addr & ~HPAGE_PMD_MASK) >> PAGE_SHIFT;
VM_BUG_ON_PAGE(!PageCompound(page) && !is_zone_device_page(page), page);
-out:
return page;
}
base-commit: 1612c382ffbdf1f673caec76502b1c00e6d35363
--
2.35.3
This reverts commit e7be8d1dd983156bbdd22c0319b71119a8fbb697 as it
causes zram failures. It does not revert cleanly, PTR_ERR handling was
introduced in the meantime. This is handled by appropriate IS_ERR.
When under memory pressure, zs_malloc() can fail. Before the above
commit, the allocation was retried with direct reclaim enabled
(GFP_NOIO). After the commit, it is not -- only __GFP_KSWAPD_RECLAIM is
tried.
So when the failure occurs under memory pressure, the overlaying
filesystem such as ext2 (mounted by ext4 module in this case) can emit
failures, making the (file)system unusable:
EXT4-fs warning (device zram0): ext4_end_bio:343: I/O error 10 writing to inode 16386 starting block 159744)
Buffer I/O error on device zram0, logical block 159744
With direct reclaim, memory is really reclaimed and allocation succeeds,
eventually. In the worst case, the oom killer is invoked, which is
proper outcome if user sets up zram too large (in comparison to
available RAM).
This very diff doesn't apply to 5.19 (stable) cleanly (see PTR_ERR note
above). Use revert of e7be8d1dd983 directly.
Link: https://bugzilla.suse.com/show_bug.cgi?id=1202203
Fixes: e7be8d1dd983 ("zram: remove double compression logic")
Cc: stable(a)vger.kernel.org # 5.19
Cc: Minchan Kim <minchan(a)kernel.org>
Cc: Nitin Gupta <ngupta(a)vflare.org>
Cc: Sergey Senozhatsky <senozhatsky(a)chromium.org>
Cc: Alexey Romanov <avromanov(a)sberdevices.ru>
Cc: Dmitry Rokosov <ddrokosov(a)sberdevices.ru>
Cc: Lukas Czerner <lczerner(a)redhat.com>
Cc: Ext4 Developers List <linux-ext4(a)vger.kernel.org>
Signed-off-by: Jiri Slaby <jslaby(a)suse.cz>
---
drivers/block/zram/zram_drv.c | 42 ++++++++++++++++++++++++++---------
drivers/block/zram/zram_drv.h | 1 +
2 files changed, 33 insertions(+), 10 deletions(-)
diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c
index 92cb929a45b7..226ea76cc819 100644
--- a/drivers/block/zram/zram_drv.c
+++ b/drivers/block/zram/zram_drv.c
@@ -1146,14 +1146,15 @@ static ssize_t bd_stat_show(struct device *dev,
static ssize_t debug_stat_show(struct device *dev,
struct device_attribute *attr, char *buf)
{
- int version = 2;
+ int version = 1;
struct zram *zram = dev_to_zram(dev);
ssize_t ret;
down_read(&zram->init_lock);
ret = scnprintf(buf, PAGE_SIZE,
- "version: %d\n%8llu\n",
+ "version: %d\n%8llu %8llu\n",
version,
+ (u64)atomic64_read(&zram->stats.writestall),
(u64)atomic64_read(&zram->stats.miss_free));
up_read(&zram->init_lock);
@@ -1351,7 +1352,7 @@ static int __zram_bvec_write(struct zram *zram, struct bio_vec *bvec,
{
int ret = 0;
unsigned long alloced_pages;
- unsigned long handle = 0;
+ unsigned long handle = -ENOMEM;
unsigned int comp_len = 0;
void *src, *dst, *mem;
struct zcomp_strm *zstrm;
@@ -1369,6 +1370,7 @@ static int __zram_bvec_write(struct zram *zram, struct bio_vec *bvec,
}
kunmap_atomic(mem);
+compress_again:
zstrm = zcomp_stream_get(zram->comp);
src = kmap_atomic(page);
ret = zcomp_compress(zstrm, src, &comp_len);
@@ -1377,20 +1379,39 @@ static int __zram_bvec_write(struct zram *zram, struct bio_vec *bvec,
if (unlikely(ret)) {
zcomp_stream_put(zram->comp);
pr_err("Compression failed! err=%d\n", ret);
+ zs_free(zram->mem_pool, handle);
return ret;
}
if (comp_len >= huge_class_size)
comp_len = PAGE_SIZE;
-
- handle = zs_malloc(zram->mem_pool, comp_len,
- __GFP_KSWAPD_RECLAIM |
- __GFP_NOWARN |
- __GFP_HIGHMEM |
- __GFP_MOVABLE);
-
+ /*
+ * handle allocation has 2 paths:
+ * a) fast path is executed with preemption disabled (for
+ * per-cpu streams) and has __GFP_DIRECT_RECLAIM bit clear,
+ * since we can't sleep;
+ * b) slow path enables preemption and attempts to allocate
+ * the page with __GFP_DIRECT_RECLAIM bit set. we have to
+ * put per-cpu compression stream and, thus, to re-do
+ * the compression once handle is allocated.
+ *
+ * if we have a 'non-null' handle here then we are coming
+ * from the slow path and handle has already been allocated.
+ */
+ if (IS_ERR((void *)handle))
+ handle = zs_malloc(zram->mem_pool, comp_len,
+ __GFP_KSWAPD_RECLAIM |
+ __GFP_NOWARN |
+ __GFP_HIGHMEM |
+ __GFP_MOVABLE);
if (IS_ERR((void *)handle)) {
zcomp_stream_put(zram->comp);
+ atomic64_inc(&zram->stats.writestall);
+ handle = zs_malloc(zram->mem_pool, comp_len,
+ GFP_NOIO | __GFP_HIGHMEM |
+ __GFP_MOVABLE);
+ if (!IS_ERR((void *)handle))
+ goto compress_again;
return PTR_ERR((void *)handle);
}
@@ -1948,6 +1969,7 @@ static int zram_add(void)
if (ZRAM_LOGICAL_BLOCK_SIZE == PAGE_SIZE)
blk_queue_max_write_zeroes_sectors(zram->disk->queue, UINT_MAX);
+ blk_queue_flag_set(QUEUE_FLAG_STABLE_WRITES, zram->disk->queue);
ret = device_add_disk(NULL, zram->disk, zram_disk_groups);
if (ret)
goto out_cleanup_disk;
diff --git a/drivers/block/zram/zram_drv.h b/drivers/block/zram/zram_drv.h
index 158c91e54850..80c3b43b4828 100644
--- a/drivers/block/zram/zram_drv.h
+++ b/drivers/block/zram/zram_drv.h
@@ -81,6 +81,7 @@ struct zram_stats {
atomic64_t huge_pages_since; /* no. of huge pages since zram set up */
atomic64_t pages_stored; /* no. of pages currently stored */
atomic_long_t max_used_pages; /* no. of maximum pages stored */
+ atomic64_t writestall; /* no. of write slow paths */
atomic64_t miss_free; /* no. of missed free */
#ifdef CONFIG_ZRAM_WRITEBACK
atomic64_t bd_count; /* no. of pages in backing device */
--
2.37.1