November 2021 - Linux-stable-mirror

[REGRESSION]: drivers/firmware: move x86 Generic System Framebuffers support

by Ilya Trukhanov

Suspend-to-RAM with elogind under Wayland stopped working in 5.15. This occurs with 5.15, 5.15.1 and latest master at 89d714ab6043bca7356b5c823f5335f5dce1f930. 5.14 and earlier releases work fine. git bisect gives d391c58271072d0b0fad93c82018d495b2633448. To reproduce: - Use elogind and Linux 5.15.1 with CONFIG_SYSFB_SIMPLEFB=n. - Start a Wayland session. I tested sway and weston, neither worked. - In a terminal emulator (I used alacritty) execute `loginctl suspend`. Normally after the last step the system would suspend, but it no longer does so after I upgraded to Linux 5.15. After running `loginctl suspend` in dmesg I get the following: [ 103.098782] elogind-daemon[2357]: Suspending system... [ 103.098794] PM: suspend entry (deep) [ 103.124621] Filesystems sync: 0.025 seconds But nothing happens afterwards. Suspend works as expected if I do any of the following: - Revert d391c58271072d0b0fad93c82018d495b2633448. - Build with CONFIG_SYSFB_SIMPLEFB=y. - Suspend from tty, even if a Wayland session is running in parallel. - Suspend from under an X11 session. - Suspend with `echo mem > /sys/power/state`. If I attach strace to the elogind-daemon process after running `loginctl suspend` then the system immediately suspends. However, if I attach strace *prior* to running `loginctl suspend` then no suspend, and the process gets stuck on a write syscall to `/sys/power/state`. I "traced" a little bit with printk (sorry, I don't know of a better way) and the call chain is as follows: state_store -> pm_suspend -> enter_state -> suspend_prepare -> pm_prepare_console -> vt_move_to_console -> vt_waitactive -> __vt_event_wait __vt_event_wait just waits until wait_event_interruptible completes, but it never does (not until I attach to elogind-daemon with strace, at least). I did not follow the chain further. - Linux version 5.15.1 (lahvuun@lahvuun) (gcc (Gentoo 11.2.0 p1) 11.2.0, GNU ld (Gentoo 2.37_p1 p0) 2.37) #51 SMP PREEMPT Tue Nov 9 23:39:25 EET 2021 - Gentoo Linux 2.8 - x86_64 AuthenticAMD - dmesg: https://pastebin.com/duj33bY8 - .config: https://pastebin.com/7Hew1g0T

3 years, 7 months

3
10
0 0

[tip: irq/urgent] PCI/MSI: Move non-mask check back into low level accessors

by tip-bot2 for Thomas Gleixner

The following commit has been merged into the irq/urgent branch of tip: Commit-ID: 9c8e9c9681a0f3f1ae90a90230d059c7a1dece5a Gitweb: https://git.kernel.org/tip/9c8e9c9681a0f3f1ae90a90230d059c7a1dece5a Author: Thomas Gleixner <tglx(a)linutronix.de> AuthorDate: Thu, 04 Nov 2021 00:27:29 +01:00 Committer: Thomas Gleixner <tglx(a)linutronix.de> CommitterDate: Thu, 11 Nov 2021 09:50:30 +01:00 PCI/MSI: Move non-mask check back into low level accessors The recent rework of PCI/MSI[X] masking moved the non-mask checks from the low level accessors into the higher level mask/unmask functions. This missed the fact that these accessors can be invoked from other places as well. The missing checks break XEN-PV which sets pci_msi_ignore_mask and also violates the virtual MSIX and the msi_attrib.maskbit protections. Instead of sprinkling checks all over the place, lift them back into the low level accessor functions. To avoid checking three different conditions combine them into one property of msi_desc::msi_attrib. [ josef: Fixed the missed conversion in the core code ] Fixes: fcacdfbef5a1 ("PCI/MSI: Provide a new set of mask and unmask functions") Reported-by: Josef Johansson <josef(a)oderland.se> Signed-off-by: Thomas Gleixner <tglx(a)linutronix.de> Tested-by: Josef Johansson <josef(a)oderland.se> Cc: Bjorn Helgaas <helgaas(a)kernel.org> Cc: stable(a)vger.kernel.org --- drivers/pci/msi.c | 26 ++++++++++++++------------ include/linux/msi.h | 2 +- kernel/irq/msi.c | 4 ++-- 3 files changed, 17 insertions(+), 15 deletions(-) diff --git a/drivers/pci/msi.c b/drivers/pci/msi.c index 12e296d..6da7910 100644 --- a/drivers/pci/msi.c +++ b/drivers/pci/msi.c @@ -148,6 +148,9 @@ static noinline void pci_msi_update_mask(struct msi_desc *desc, u32 clear, u32 s raw_spinlock_t *lock = &desc->dev->msi_lock; unsigned long flags; + if (!desc->msi_attrib.can_mask) + return; + raw_spin_lock_irqsave(lock, flags); desc->msi_mask &= ~clear; desc->msi_mask |= set; @@ -181,7 +184,8 @@ static void pci_msix_write_vector_ctrl(struct msi_desc *desc, u32 ctrl) { void __iomem *desc_addr = pci_msix_desc_addr(desc); - writel(ctrl, desc_addr + PCI_MSIX_ENTRY_VECTOR_CTRL); + if (desc->msi_attrib.can_mask) + writel(ctrl, desc_addr + PCI_MSIX_ENTRY_VECTOR_CTRL); } static inline void pci_msix_mask(struct msi_desc *desc) @@ -200,23 +204,17 @@ static inline void pci_msix_unmask(struct msi_desc *desc) static void __pci_msi_mask_desc(struct msi_desc *desc, u32 mask) { - if (pci_msi_ignore_mask || desc->msi_attrib.is_virtual) - return; - if (desc->msi_attrib.is_msix) pci_msix_mask(desc); - else if (desc->msi_attrib.maskbit) + else pci_msi_mask(desc, mask); } static void __pci_msi_unmask_desc(struct msi_desc *desc, u32 mask) { - if (pci_msi_ignore_mask || desc->msi_attrib.is_virtual) - return; - if (desc->msi_attrib.is_msix) pci_msix_unmask(desc); - else if (desc->msi_attrib.maskbit) + else pci_msi_unmask(desc, mask); } @@ -484,7 +482,8 @@ msi_setup_entry(struct pci_dev *dev, int nvec, struct irq_affinity *affd) entry->msi_attrib.is_64 = !!(control & PCI_MSI_FLAGS_64BIT); entry->msi_attrib.is_virtual = 0; entry->msi_attrib.entry_nr = 0; - entry->msi_attrib.maskbit = !!(control & PCI_MSI_FLAGS_MASKBIT); + entry->msi_attrib.can_mask = !pci_msi_ignore_mask && + !!(control & PCI_MSI_FLAGS_MASKBIT); entry->msi_attrib.default_irq = dev->irq; /* Save IOAPIC IRQ */ entry->msi_attrib.multi_cap = (control & PCI_MSI_FLAGS_QMASK) >> 1; entry->msi_attrib.multiple = ilog2(__roundup_pow_of_two(nvec)); @@ -495,7 +494,7 @@ msi_setup_entry(struct pci_dev *dev, int nvec, struct irq_affinity *affd) entry->mask_pos = dev->msi_cap + PCI_MSI_MASK_32; /* Save the initial mask status */ - if (entry->msi_attrib.maskbit) + if (entry->msi_attrib.can_mask) pci_read_config_dword(dev, entry->mask_pos, &entry->msi_mask); out: @@ -639,10 +638,13 @@ static int msix_setup_entries(struct pci_dev *dev, void __iomem *base, entry->msi_attrib.is_virtual = entry->msi_attrib.entry_nr >= vec_count; + entry->msi_attrib.can_mask = !pci_msi_ignore_mask && + !entry->msi_attrib.is_virtual; + entry->msi_attrib.default_irq = dev->irq; entry->mask_base = base; - if (!entry->msi_attrib.is_virtual) { + if (entry->msi_attrib.can_mask) { addr = pci_msix_desc_addr(entry); entry->msix_ctrl = readl(addr + PCI_MSIX_ENTRY_VECTOR_CTRL); } diff --git a/include/linux/msi.h b/include/linux/msi.h index 49cf6eb..e616f94 100644 --- a/include/linux/msi.h +++ b/include/linux/msi.h @@ -148,7 +148,7 @@ struct msi_desc { u8 is_msix : 1; u8 multiple : 3; u8 multi_cap : 3; - u8 maskbit : 1; + u8 can_mask : 1; u8 is_64 : 1; u8 is_virtual : 1; u16 entry_nr; diff --git a/kernel/irq/msi.c b/kernel/irq/msi.c index 6a5ecee..7f350ae 100644 --- a/kernel/irq/msi.c +++ b/kernel/irq/msi.c @@ -529,10 +529,10 @@ static bool msi_check_reservation_mode(struct irq_domain *domain, /* * Checking the first MSI descriptor is sufficient. MSIX supports - * masking and MSI does so when the maskbit is set. + * masking and MSI does so when the can_mask attribute is set. */ desc = first_msi_entry(dev); - return desc->msi_attrib.is_msix || desc->msi_attrib.maskbit; + return desc->msi_attrib.is_msix || desc->msi_attrib.can_mask; } int __msi_domain_alloc_irqs(struct irq_domain *domain, struct device *dev,

3 years, 7 months

1
0
0 0

[tip: irq/urgent] PCI/MSI: Destroy sysfs before freeing entries

by tip-bot2 for Thomas Gleixner

The following commit has been merged into the irq/urgent branch of tip: Commit-ID: 3735459037114d31e5acd9894fad9aed104231a0 Gitweb: https://git.kernel.org/tip/3735459037114d31e5acd9894fad9aed104231a0 Author: Thomas Gleixner <tglx(a)linutronix.de> AuthorDate: Tue, 09 Nov 2021 14:53:57 +01:00 Committer: Thomas Gleixner <tglx(a)linutronix.de> CommitterDate: Thu, 11 Nov 2021 09:50:31 +01:00 PCI/MSI: Destroy sysfs before freeing entries free_msi_irqs() frees the MSI entries before destroying the sysfs entries which are exposing them. Nothing prevents a concurrent free while a sysfs file is read and accesses the possibly freed entry. Move the sysfs release ahead of freeing the entries. Fixes: 1c51b50c2995 ("PCI/MSI: Export MSI mode using attributes, not kobjects") Signed-off-by: Thomas Gleixner <tglx(a)linutronix.de> Reviewed-by: Greg Kroah-Hartman <gregkh(a)linuxfoundation.org> Cc: Bjorn Helgaas <helgaas(a)kernel.org> Cc: stable(a)vger.kernel.org Link: https://lore.kernel.org/r/87sfw5305m.ffs@tglx --- drivers/pci/msi.c | 10 +++++----- 1 file changed, 5 insertions(+), 5 deletions(-) diff --git a/drivers/pci/msi.c b/drivers/pci/msi.c index 7043301..48e3f4e 100644 --- a/drivers/pci/msi.c +++ b/drivers/pci/msi.c @@ -368,6 +368,11 @@ static void free_msi_irqs(struct pci_dev *dev) for (i = 0; i < entry->nvec_used; i++) BUG_ON(irq_has_action(entry->irq + i)); + if (dev->msi_irq_groups) { + msi_destroy_sysfs(&dev->dev, dev->msi_irq_groups); + dev->msi_irq_groups = NULL; + } + pci_msi_teardown_msi_irqs(dev); list_for_each_entry_safe(entry, tmp, msi_list, list) { @@ -379,11 +384,6 @@ static void free_msi_irqs(struct pci_dev *dev) list_del(&entry->list); free_msi_entry(entry); } - - if (dev->msi_irq_groups) { - msi_destroy_sysfs(&dev->dev, dev->msi_irq_groups); - dev->msi_irq_groups = NULL; - } } static void pci_intx_for_msi(struct pci_dev *dev, int enable)

3 years, 7 months

1
0
0 0

[PATCH] drm/amd/display: Look at firmware version to determine using dmub on dcn21

by Mario Limonciello

Newer DMUB firmware on Renoir and Green Sardine do not need to disable dmcu and this actually causes problems with DP-C alt mode for a number of machines. Backport the fix from this from mainline. It's a hand modified backport because mainline switched to IP version checking which doesn't exist in linux-stable. BugLink: https://gitlab.freedesktop.org/drm/amd/-/issues/1772 BugLink: https://gitlab.freedesktop.org/drm/amd/-/issues/1735 Signed-off-by: Mario Limonciello <mario.limonciello(a)amd.com> Reviewed-by: Alex Deucher <alexander.deucher(a)amd.com> --- Resend, also pick up Alex's tag from last submission This was previously sent to stable(a)kernel.org not stable(a)vger.kernel.org. drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c | 9 ++++++++- 1 file changed, 8 insertions(+), 1 deletion(-) diff --git a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c index 1ea31dcc7a8b..084491afe540 100644 --- a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c +++ b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c @@ -1141,8 +1141,15 @@ static int amdgpu_dm_init(struct amdgpu_device *adev) case CHIP_RAVEN: case CHIP_RENOIR: init_data.flags.gpu_vm_support = true; - if (ASICREV_IS_GREEN_SARDINE(adev->external_rev_id)) + switch (adev->dm.dmcub_fw_version) { + case 0: /* development */ + case 0x1: /* linux-firmware.git hash 6d9f399 */ + case 0x01000000: /* linux-firmware.git hash 9a0b0f4 */ + init_data.flags.disable_dmcu = false; + break; + default: init_data.flags.disable_dmcu = true; + } break; case CHIP_VANGOGH: case CHIP_YELLOW_CARP: -- 2.25.1

3 years, 7 months

2
1
0 0

[PATCH V2] x86/sgx: Fix free page accounting

by Reinette Chatre

The SGX driver maintains a single global free page counter, sgx_nr_free_pages, that reflects the number of free pages available across all NUMA nodes. Correspondingly, a list of free pages is associated with each NUMA node and sgx_nr_free_pages is updated every time a page is added or removed from any of the free page lists. The main usage of sgx_nr_free_pages is by the reclaimer that will run when it (sgx_nr_free_pages) goes below a watermark to ensure that there are always some free pages available to, for example, support efficient page faults. With sgx_nr_free_pages accessed and modified from a few places it is essential to ensure that these accesses are done safely but this is not the case. sgx_nr_free_pages is read without any protection and updated with inconsistent protection by any one of the spin locks associated with the individual NUMA nodes. For example: CPU_A CPU_B ----- ----- spin_lock(&nodeA->lock); spin_lock(&nodeB->lock); ... ... sgx_nr_free_pages--; /* NOT SAFE */ sgx_nr_free_pages--; spin_unlock(&nodeA->lock); spin_unlock(&nodeB->lock); The consequence of sgx_nr_free_pages not being protected is that its value may not accurately reflect the actual number of free pages on the system, impacting the availability of free pages in support of many flows. The problematic scenario is when the reclaimer does not run because it believes there to be sufficient free pages while any attempt to allocate a page fails because there are no free pages available. The worst scenario observed was a user space hang because of repeated page faults caused by no free pages made available. The following flow was encountered: asm_exc_page_fault ... sgx_vma_fault() sgx_encl_load_page() sgx_encl_eldu() // Encrypted page needs to be loaded from backing // storage into newly allocated SGX memory page sgx_alloc_epc_page() // Allocate a page of SGX memory __sgx_alloc_epc_page() // Fails, no free SGX memory ... if (sgx_should_reclaim(SGX_NR_LOW_PAGES)) // Wake reclaimer wake_up(&ksgxd_waitq); return -EBUSY; // Return -EBUSY giving reclaimer time to run return -EBUSY; return -EBUSY; return VM_FAULT_NOPAGE; The reclaimer is triggered in above flow with the following code: static bool sgx_should_reclaim(unsigned long watermark) { return sgx_nr_free_pages < watermark && !list_empty(&sgx_active_page_list); } In the problematic scenario there were no free pages available yet the value of sgx_nr_free_pages was above the watermark. The allocation of SGX memory thus always failed because of a lack of free pages while no free pages were made available because the reclaimer is never started because of sgx_nr_free_pages' incorrect value. The consequence was that user space kept encountering VM_FAULT_NOPAGE that caused the same address to be accessed repeatedly with the same result. Change the global free page counter to an atomic type that ensures simultaneous updates are done safely. While doing so, move the updating of the variable outside of the spin lock critical section to which it does not belong. Cc: stable(a)vger.kernel.org Fixes: 901ddbb9ecf5 ("x86/sgx: Add a basic NUMA allocation scheme to sgx_alloc_epc_page()") Suggested-by: Dave Hansen <dave.hansen(a)linux.intel.com> Reviewed-by: Tony Luck <tony.luck(a)intel.com> Signed-off-by: Reinette Chatre <reinette.chatre(a)intel.com> --- Changes since V1: - V1: https://lore.kernel.org/lkml/373992d869cd356ce9e9afe43ef4934b70d604fd.16360… - Add static to definition of sgx_nr_free_pages (Tony). - Add Tony's signature. - Provide detail about error scenario in changelog (Jarkko). arch/x86/kernel/cpu/sgx/main.c | 12 ++++++------ 1 file changed, 6 insertions(+), 6 deletions(-) diff --git a/arch/x86/kernel/cpu/sgx/main.c b/arch/x86/kernel/cpu/sgx/main.c index 63d3de02bbcc..8471a8b9b48e 100644 --- a/arch/x86/kernel/cpu/sgx/main.c +++ b/arch/x86/kernel/cpu/sgx/main.c @@ -28,8 +28,7 @@ static DECLARE_WAIT_QUEUE_HEAD(ksgxd_waitq); static LIST_HEAD(sgx_active_page_list); static DEFINE_SPINLOCK(sgx_reclaimer_lock); -/* The free page list lock protected variables prepend the lock. */ -static unsigned long sgx_nr_free_pages; +static atomic_long_t sgx_nr_free_pages = ATOMIC_LONG_INIT(0); /* Nodes with one or more EPC sections. */ static nodemask_t sgx_numa_mask; @@ -403,14 +402,15 @@ static void sgx_reclaim_pages(void) spin_lock(&node->lock); list_add_tail(&epc_page->list, &node->free_page_list); - sgx_nr_free_pages++; spin_unlock(&node->lock); + atomic_long_inc(&sgx_nr_free_pages); } } static bool sgx_should_reclaim(unsigned long watermark) { - return sgx_nr_free_pages < watermark && !list_empty(&sgx_active_page_list); + return atomic_long_read(&sgx_nr_free_pages) < watermark && + !list_empty(&sgx_active_page_list); } static int ksgxd(void *p) @@ -471,9 +471,9 @@ static struct sgx_epc_page *__sgx_alloc_epc_page_from_node(int nid) page = list_first_entry(&node->free_page_list, struct sgx_epc_page, list); list_del_init(&page->list); - sgx_nr_free_pages--; spin_unlock(&node->lock); + atomic_long_dec(&sgx_nr_free_pages); return page; } @@ -625,9 +625,9 @@ void sgx_free_epc_page(struct sgx_epc_page *page) spin_lock(&node->lock); list_add_tail(&page->list, &node->free_page_list); - sgx_nr_free_pages++; spin_unlock(&node->lock); + atomic_long_inc(&sgx_nr_free_pages); } static bool __init sgx_setup_epc_section(u64 phys_addr, u64 size, -- 2.25.1

3 years, 7 months

4
8
0 0

stable-rc/queue/5.14 baseline: 171 runs, 1 regressions (v5.14.17-24-g490e6570185e)

by kernelci.org bot

stable-rc/queue/5.14 baseline: 171 runs, 1 regressions (v5.14.17-24-g490e6570185e) Regressions Summary ------------------- platform | arch | lab | compiler | defconfig | regressions ----------+------+--------------+----------+---------------------+------------ beagle-xm | arm | lab-baylibre | gcc-10 | omap2plus_defconfig | 1 Details: https://kernelci.org/test/job/stable-rc/branch/queue%2F5.14/kernel/v5.14.17… Test: baseline Tree: stable-rc Branch: queue/5.14 Describe: v5.14.17-24-g490e6570185e URL: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable-rc.git SHA: 490e6570185e6437cf1186a728921c1311ebdcc6 Test Regressions ---------------- platform | arch | lab | compiler | defconfig | regressions ----------+------+--------------+----------+---------------------+------------ beagle-xm | arm | lab-baylibre | gcc-10 | omap2plus_defconfig | 1 Details: https://kernelci.org/test/plan/id/618c4e46cfa5b918f83358fe Results: 0 PASS, 1 FAIL, 0 SKIP Full config: omap2plus_defconfig Compiler: gcc-10 (arm-linux-gnueabihf-gcc (Debian 10.2.1-6) 10.2.1 20210110) Plain log: https://storage.kernelci.org//stable-rc/queue-5.14/v5.14.17-24-g490e6570185… HTML log: https://storage.kernelci.org//stable-rc/queue-5.14/v5.14.17-24-g490e6570185… Rootfs: http://storage.kernelci.org/images/rootfs/buildroot/kci-2020.05-6-g8983f3b7… * baseline.login: https://kernelci.org/test/case/id/618c4e46cfa5b918f83358ff failing since 17 days (last pass: v5.14.14-64-gb66eb77f69e4, first fail: v5.14.14-124-g710e5bbf51e3)

3 years, 7 months

1
0
0 0

[merged] mm-thp-fix-incorrect-unmap-behavior-for-private-pages.patch removed from -mm tree

by akpm＠linux-foundation.org

The patch titled Subject: mm, thp: fix incorrect unmap behavior for private pages has been removed from the -mm tree. Its filename was mm-thp-fix-incorrect-unmap-behavior-for-private-pages.patch This patch was dropped because it was merged into mainline or a subsystem tree ------------------------------------------------------ From: Rongwei Wang <rongwei.wang(a)linux.alibaba.com> Subject: mm, thp: fix incorrect unmap behavior for private pages When truncating pagecache on file THP, the private pages of a process should not be unmapped mapping. This incorrect behavior on a dynamic shared libraries which will cause related processes to happen core dump. A simple test for a DSO (Prerequisite is the DSO mapped in file THP): int main(int argc, char *argv[]) { int fd; fd = open(argv[1], O_WRONLY); if (fd < 0) { perror("open"); } close(fd); return 0; } The test only to open a target DSO, and do nothing. But this operation will lead one or more process to happen core dump. This patch mainly to fix this bug. Link: https://lkml.kernel.org/r/20211025092134.18562-3-rongwei.wang@linux.alibaba… Fixes: eb6ecbed0aa2 ("mm, thp: relax the VM_DENYWRITE constraint on file-backed THPs") Signed-off-by: Rongwei Wang <rongwei.wang(a)linux.alibaba.com> Tested-by: Xu Yu <xuyu(a)linux.alibaba.com> Cc: Matthew Wilcox (Oracle) <willy(a)infradead.org> Cc: Song Liu <song(a)kernel.org> Cc: William Kucharski <william.kucharski(a)oracle.com> Cc: Hugh Dickins <hughd(a)google.com> Cc: Yang Shi <shy828301(a)gmail.com> Cc: Mike Kravetz <mike.kravetz(a)oracle.com> Cc: Collin Fijalkovich <cfijalkovich(a)google.com> Cc: <stable(a)vger.kernel.org> Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org> --- fs/open.c | 11 ++++++++++- 1 file changed, 10 insertions(+), 1 deletion(-) --- a/fs/open.c~mm-thp-fix-incorrect-unmap-behavior-for-private-pages +++ a/fs/open.c @@ -857,8 +857,17 @@ static int do_dentry_open(struct file *f */ smp_mb(); if (filemap_nr_thps(inode->i_mapping)) { + struct address_space *mapping = inode->i_mapping; + filemap_invalidate_lock(inode->i_mapping); - truncate_pagecache(inode, 0); + /* + * unmap_mapping_range just need to be called once + * here, because the private pages is not need to be + * unmapped mapping (e.g. data segment of dynamic + * shared libraries here). + */ + unmap_mapping_range(mapping, 0, 0, 0); + truncate_inode_pages(mapping, 0); filemap_invalidate_unlock(inode->i_mapping); } } _ Patches currently in -mm which might be from rongwei.wang(a)linux.alibaba.com are

3 years, 7 months

1
0
0 0

[merged] mm-thp-lock-filemap-when-truncating-page-cache.patch removed from -mm tree

by akpm＠linux-foundation.org

The patch titled Subject: mm, thp: lock filemap when truncating page cache has been removed from the -mm tree. Its filename was mm-thp-lock-filemap-when-truncating-page-cache.patch This patch was dropped because it was merged into mainline or a subsystem tree ------------------------------------------------------ From: Rongwei Wang <rongwei.wang(a)linux.alibaba.com> Subject: mm, thp: lock filemap when truncating page cache Patch series "fix two bugs for file THP". This patch (of 2): Transparent huge page has supported read-only non-shmem files. The file- backed THP is collapsed by khugepaged and truncated when written (for shared libraries). However, there is a race when multiple writers truncate the same page cache concurrently. In that case, subpage(s) of file THP can be revealed by find_get_entry in truncate_inode_pages_range, which will trigger PageTail BUG_ON in truncate_inode_page, as follows. page:000000009e420ff2 refcount:1 mapcount:0 mapping:0000000000000000 index:0x7ff pfn:0x50c3ff head:0000000075ff816d order:9 compound_mapcount:0 compound_pincount:0 flags: 0x37fffe0000010815(locked|uptodate|lru|arch_1|head) raw: 37fffe0000000000 fffffe0013108001 dead000000000122 dead000000000400 raw: 0000000000000001 0000000000000000 00000000ffffffff 0000000000000000 head: 37fffe0000010815 fffffe001066bd48 ffff000404183c20 0000000000000000 head: 0000000000000600 0000000000000000 00000001ffffffff ffff000c0345a000 page dumped because: VM_BUG_ON_PAGE(PageTail(page)) ------------[ cut here ]------------ kernel BUG at mm/truncate.c:213! Internal error: Oops - BUG: 0 [#1] SMP Modules linked in: xfs(E) libcrc32c(E) rfkill(E) ... CPU: 14 PID: 11394 Comm: check_madvise_d Kdump: ... Hardware name: ECS, BIOS 0.0.0 02/06/2015 pstate: 60400005 (nZCv daif +PAN -UAO -TCO BTYPE=--) pc : truncate_inode_page+0x64/0x70 lr : truncate_inode_page+0x64/0x70 sp : ffff80001b60b900 x29: ffff80001b60b900 x28: 00000000000007ff x27: ffff80001b60b9a0 x26: 0000000000000000 x25: 000000000000000f x24: ffff80001b60b9a0 x23: ffff80001b60ba18 x22: ffff0001e0999ea8 x21: ffff0000c21db300 x20: ffffffffffffffff x19: fffffe001310ffc0 x18: 0000000000000020 x17: 0000000000000000 x16: 0000000000000000 x15: ffff0000c21db960 x14: 3030306666666620 x13: 6666666666666666 x12: 3130303030303030 x11: ffff8000117b69b8 x10: 00000000ffff8000 x9 : ffff80001012690c x8 : 0000000000000000 x7 : ffff8000114f69b8 x6 : 0000000000017ffd x5 : ffff0007fffbcbc8 x4 : ffff80001b60b5c0 x3 : 0000000000000001 x2 : 0000000000000000 x1 : 0000000000000000 x0 : 0000000000000000 Call trace: truncate_inode_page+0x64/0x70 truncate_inode_pages_range+0x550/0x7e4 truncate_pagecache+0x58/0x80 do_dentry_open+0x1e4/0x3c0 vfs_open+0x38/0x44 do_open+0x1f0/0x310 path_openat+0x114/0x1dc do_filp_open+0x84/0x134 do_sys_openat2+0xbc/0x164 __arm64_sys_openat+0x74/0xc0 el0_svc_common.constprop.0+0x88/0x220 do_el0_svc+0x30/0xa0 el0_svc+0x20/0x30 el0_sync_handler+0x1a4/0x1b0 el0_sync+0x180/0x1c0 Code: aa0103e0 900061e1 910ec021 9400d300 (d4210000) ---[ end trace f70cdb42cb7c2d42 ]--- Kernel panic - not syncing: Oops - BUG: Fatal exception This patch mainly to lock filemap when one enter truncate_pagecache(), avoiding truncating the same page cache concurrently. Link: https://lkml.kernel.org/r/20211025092134.18562-1-rongwei.wang@linux.alibaba… Link: https://lkml.kernel.org/r/20211025092134.18562-2-rongwei.wang@linux.alibaba… Fixes: eb6ecbed0aa2 ("mm, thp: relax the VM_DENYWRITE constraint on file-backed THPs") Signed-off-by: Xu Yu <xuyu(a)linux.alibaba.com> Signed-off-by: Rongwei Wang <rongwei.wang(a)linux.alibaba.com> Suggested-by: Matthew Wilcox (Oracle) <willy(a)infradead.org> Tested-by: Song Liu <song(a)kernel.org> Cc: Collin Fijalkovich <cfijalkovich(a)google.com> Cc: Hugh Dickins <hughd(a)google.com> Cc: Mike Kravetz <mike.kravetz(a)oracle.com> Cc: William Kucharski <william.kucharski(a)oracle.com> Cc: Yang Shi <shy828301(a)gmail.com> Cc: <stable(a)vger.kernel.org> Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org> --- fs/open.c | 5 ++++- 1 file changed, 4 insertions(+), 1 deletion(-) --- a/fs/open.c~mm-thp-lock-filemap-when-truncating-page-cache +++ a/fs/open.c @@ -856,8 +856,11 @@ static int do_dentry_open(struct file *f * of THPs into the page cache will fail. */ smp_mb(); - if (filemap_nr_thps(inode->i_mapping)) + if (filemap_nr_thps(inode->i_mapping)) { + filemap_invalidate_lock(inode->i_mapping); truncate_pagecache(inode, 0); + filemap_invalidate_unlock(inode->i_mapping); + } } return 0; _ Patches currently in -mm which might be from rongwei.wang(a)linux.alibaba.com are

3 years, 7 months

1
0
0 0

[merged] memcg-prohibit-unconditional-exceeding-the-limit-of-dying-tasks.patch removed from -mm tree

by akpm＠linux-foundation.org

The patch titled Subject: memcg: prohibit unconditional exceeding the limit of dying tasks has been removed from the -mm tree. Its filename was memcg-prohibit-unconditional-exceeding-the-limit-of-dying-tasks.patch This patch was dropped because it was merged into mainline or a subsystem tree ------------------------------------------------------ From: Vasily Averin <vvs(a)virtuozzo.com> Subject: memcg: prohibit unconditional exceeding the limit of dying tasks Memory cgroup charging allows killed or exiting tasks to exceed the hard limit. It is assumed that the amount of the memory charged by those tasks is bound and most of the memory will get released while the task is exiting. This is resembling a heuristic for the global OOM situation when tasks get access to memory reserves. There is no global memory shortage at the memcg level so the memcg heuristic is more relieved. The above assumption is overly optimistic though. E.g. vmalloc can scale to really large requests and the heuristic would allow that. We used to have an early break in the vmalloc allocator for killed tasks but this has been reverted by commit b8c8a338f75e ("Revert "vmalloc: back off when the current task is killed""). There are likely other similar code paths which do not check for fatal signals in an allocation&charge loop. Also there are some kernel objects charged to a memcg which are not bound to a process life time. It has been observed that it is not really hard to trigger these bypasses and cause global OOM situation. One potential way to address these runaways would be to limit the amount of excess (similar to the global OOM with limited oom reserves). This is certainly possible but it is not really clear how much of an excess is desirable and still protects from global OOMs as that would have to consider the overall memcg configuration. This patch is addressing the problem by removing the heuristic altogether. Bypass is only allowed for requests which either cannot fail or where the failure is not desirable while excess should be still limited (e.g. atomic requests). Implementation wise a killed or dying task fails to charge if it has passed the OOM killer stage. That should give all forms of reclaim chance to restore the limit before the failure (ENOMEM) and tell the caller to back off. In addition, this patch renames should_force_charge() helper to task_is_dying() because now its use is not associated witch forced charging. This patch depends on pagefault_out_of_memory() to not trigger out_of_memory(), because then a memcg failure can unwind to VM_FAULT_OOM and cause a global OOM killer. Link: https://lkml.kernel.org/r/8f5cebbb-06da-4902-91f0-6566fc4b4203@virtuozzo.com Signed-off-by: Vasily Averin <vvs(a)virtuozzo.com> Suggested-by: Michal Hocko <mhocko(a)suse.com> Acked-by: Michal Hocko <mhocko(a)suse.com> Cc: Johannes Weiner <hannes(a)cmpxchg.org> Cc: Vladimir Davydov <vdavydov.dev(a)gmail.com> Cc: Roman Gushchin <guro(a)fb.com> Cc: Uladzislau Rezki <urezki(a)gmail.com> Cc: Vlastimil Babka <vbabka(a)suse.cz> Cc: Shakeel Butt <shakeelb(a)google.com> Cc: Mel Gorman <mgorman(a)techsingularity.net> Cc: Tetsuo Handa <penguin-kernel(a)i-love.sakura.ne.jp> Cc: <stable(a)vger.kernel.org> Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org> --- mm/memcontrol.c | 27 ++++++++------------------- 1 file changed, 8 insertions(+), 19 deletions(-) --- a/mm/memcontrol.c~memcg-prohibit-unconditional-exceeding-the-limit-of-dying-tasks +++ a/mm/memcontrol.c @@ -234,7 +234,7 @@ enum res_type { iter != NULL; \ iter = mem_cgroup_iter(NULL, iter, NULL)) -static inline bool should_force_charge(void) +static inline bool task_is_dying(void) { return tsk_is_oom_victim(current) || fatal_signal_pending(current) || (current->flags & PF_EXITING); @@ -1624,7 +1624,7 @@ static bool mem_cgroup_out_of_memory(str * A few threads which were not waiting at mutex_lock_killable() can * fail to bail out. Therefore, check again after holding oom_lock. */ - ret = should_force_charge() || out_of_memory(&oc); + ret = task_is_dying() || out_of_memory(&oc); unlock: mutex_unlock(&oom_lock); @@ -2579,6 +2579,7 @@ static int try_charge_memcg(struct mem_c struct page_counter *counter; enum oom_status oom_status; unsigned long nr_reclaimed; + bool passed_oom = false; bool may_swap = true; bool drained = false; unsigned long pflags; @@ -2614,15 +2615,6 @@ retry: goto force; /* - * Unlike in global OOM situations, memcg is not in a physical - * memory shortage. Allow dying and OOM-killed tasks to - * bypass the last charges so that they can exit quickly and - * free their memory. - */ - if (unlikely(should_force_charge())) - goto force; - - /* * Prevent unbounded recursion when reclaim operations need to * allocate memory. This might exceed the limits temporarily, * but we prefer facilitating memory reclaim and getting back @@ -2679,8 +2671,9 @@ retry: if (gfp_mask & __GFP_RETRY_MAYFAIL) goto nomem; - if (fatal_signal_pending(current)) - goto force; + /* Avoid endless loop for tasks bypassed by the oom killer */ + if (passed_oom && task_is_dying()) + goto nomem; /* * keep retrying as long as the memcg oom killer is able to make @@ -2689,14 +2682,10 @@ retry: */ oom_status = mem_cgroup_oom(mem_over_limit, gfp_mask, get_order(nr_pages * PAGE_SIZE)); - switch (oom_status) { - case OOM_SUCCESS: + if (oom_status == OOM_SUCCESS) { + passed_oom = true; nr_retries = MAX_RECLAIM_RETRIES; goto retry; - case OOM_FAILED: - goto force; - default: - goto nomem; } nomem: if (!(gfp_mask & __GFP_NOFAIL)) _ Patches currently in -mm which might be from vvs(a)virtuozzo.com are

3 years, 7 months

1
0
0 0

[merged] mm-oom-do-not-trigger-out_of_memory-from-the-pf.patch removed from -mm tree

by akpm＠linux-foundation.org

The patch titled Subject: mm, oom: do not trigger out_of_memory from the #PF has been removed from the -mm tree. Its filename was mm-oom-do-not-trigger-out_of_memory-from-the-pf.patch This patch was dropped because it was merged into mainline or a subsystem tree ------------------------------------------------------ From: Michal Hocko <mhocko(a)suse.com> Subject: mm, oom: do not trigger out_of_memory from the #PF Any allocation failure during the #PF path will return with VM_FAULT_OOM which in turn results in pagefault_out_of_memory. This can happen for 2 different reasons. a) Memcg is out of memory and we rely on mem_cgroup_oom_synchronize to perform the memcg OOM handling or b) normal allocation fails. The latter is quite problematic because allocation paths already trigger out_of_memory and the page allocator tries really hard to not fail allocations. Anyway, if the OOM killer has been already invoked there is no reason to invoke it again from the #PF path. Especially when the OOM condition might be gone by that time and we have no way to find out other than allocate. Moreover if the allocation failed and the OOM killer hasn't been invoked then we are unlikely to do the right thing from the #PF context because we have already lost the allocation context and restictions and therefore might oom kill a task from a different NUMA domain. This all suggests that there is no legitimate reason to trigger out_of_memory from pagefault_out_of_memory so drop it. Just to be sure that no #PF path returns with VM_FAULT_OOM without allocation print a warning that this is happening before we restart the #PF. [VvS: #PF allocation can hit into limit of cgroup v1 kmem controller. This is a local problem related to memcg, however, it causes unnecessary global OOM kills that are repeated over and over again and escalate into a real disaster. This has been broken since kmem accounting has been introduced for cgroup v1 (3.8). There was no kmem specific reclaim for the separate limit so the only way to handle kmem hard limit was to return with ENOMEM. In upstream the problem will be fixed by removing the outdated kmem limit, however stable and LTS kernels cannot do it and are still affected. This patch fixes the problem and should be backported into stable/LTS.] Link: https://lkml.kernel.org/r/f5fd8dd8-0ad4-c524-5f65-920b01972a42@virtuozzo.com Signed-off-by: Michal Hocko <mhocko(a)suse.com> Signed-off-by: Vasily Averin <vvs(a)virtuozzo.com> Acked-by: Michal Hocko <mhocko(a)suse.com> Cc: Johannes Weiner <hannes(a)cmpxchg.org> Cc: Mel Gorman <mgorman(a)techsingularity.net> Cc: Roman Gushchin <guro(a)fb.com> Cc: Shakeel Butt <shakeelb(a)google.com> Cc: Tetsuo Handa <penguin-kernel(a)i-love.sakura.ne.jp> Cc: Uladzislau Rezki <urezki(a)gmail.com> Cc: Vladimir Davydov <vdavydov.dev(a)gmail.com> Cc: Vlastimil Babka <vbabka(a)suse.cz> Cc: <stable(a)vger.kernel.org> Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org> --- mm/oom_kill.c | 22 ++++++++-------------- 1 file changed, 8 insertions(+), 14 deletions(-) --- a/mm/oom_kill.c~mm-oom-do-not-trigger-out_of_memory-from-the-pf +++ a/mm/oom_kill.c @@ -1120,19 +1120,15 @@ bool out_of_memory(struct oom_control *o } /* - * The pagefault handler calls here because it is out of memory, so kill a - * memory-hogging task. If oom_lock is held by somebody else, a parallel oom - * killing is already in progress so do nothing. + * The pagefault handler calls here because some allocation has failed. We have + * to take care of the memcg OOM here because this is the only safe context without + * any locks held but let the oom killer triggered from the allocation context care + * about the global OOM. */ void pagefault_out_of_memory(void) { - struct oom_control oc = { - .zonelist = NULL, - .nodemask = NULL, - .memcg = NULL, - .gfp_mask = 0, - .order = 0, - }; + static DEFINE_RATELIMIT_STATE(pfoom_rs, DEFAULT_RATELIMIT_INTERVAL, + DEFAULT_RATELIMIT_BURST); if (mem_cgroup_oom_synchronize(true)) return; @@ -1140,10 +1136,8 @@ void pagefault_out_of_memory(void) if (fatal_signal_pending(current)) return; - if (!mutex_trylock(&oom_lock)) - return; - out_of_memory(&oc); - mutex_unlock(&oom_lock); + if (__ratelimit(&pfoom_rs)) + pr_warn("Huh VM_FAULT_OOM leaked out to the #PF handler. Retrying PF\n"); } SYSCALL_DEFINE2(process_mrelease, int, pidfd, unsigned int, flags) _ Patches currently in -mm which might be from mhocko(a)suse.com are

3 years, 7 months

1
0
0 0

2025

2024

2023

2022

2021

2020

2019

2018

2017

Linux-stable-mirror November 2021