August 2023 - Linux-stable-mirror

FAILED: patch "[PATCH] net: phy: broadcom: stub c45 read/write for 54810" failed to apply to 4.19-stable tree

by gregkh＠linuxfoundation.org

The patch below does not apply to the 4.19-stable tree. If someone wants it applied there, or to any other stable or longterm tree, then please email the backport, including the original git commit id to <stable(a)vger.kernel.org>. To reproduce the conflict and resubmit, you may use the following commands: git fetch https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/ linux-4.19.y git checkout FETCH_HEAD git cherry-pick -x 096516d092d54604d590827d05b1022c8f326639 # <resolve conflicts, build, test, etc.> git commit -s git send-email --to '<stable(a)vger.kernel.org>' --in-reply-to '2023082133-mashing-flick-3a50@gregkh' --subject-prefix 'PATCH 4.19.y' HEAD^.. Possible dependencies: 096516d092d5 ("net: phy: broadcom: stub c45 read/write for 54810") 5a32fcdb1e68 ("net: phy: broadcom: Add statistics for all Gigabit PHYs") 1e2e61af1996 ("net: phy: broadcom: remove BCM5482 1000Base-BX support") 15772e4ddf3f ("net: phy: broadcom: remove use of ack_interrupt()") 4567d5c3eb9b ("net: phy: broadcom: implement generic .handle_interrupt() callback") b0ed0bbfb304 ("net: phy: broadcom: add support for BCM54811 PHY") 9d42205036d4 ("net: phy: bcm54140: Make a bunch of functions static") 6937602ed3f9 ("net: phy: add Broadcom BCM54140 support") 123aff2a789c ("net: phy: broadcom: Add support for BCM53125 internal PHYs") fe26821fa614 ("net: phy: broadcom: Wire suspend/resume for BCM54810") 0ececcfc9267 ("net: phy: broadcom: Allow BCM54810 to use bcm54xx_adjust_rxrefclk()") 75f4d8d10e01 ("net: phy: add Broadcom BCM84881 PHY driver") b9bcb95315fe ("net: phy: broadcom: add 1000Base-X support for BCM54616S") 283da99af1d8 ("net: phy: broadcom: Add genphy_suspend and genphy_resume for BCM5464") dcdecdcfe1fc ("net: phy: switch drivers to use dynamic feature detection") 5c3407abb338 ("net: phy: meson-gxl: add g12a support") 356d71e00d27 ("Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net") thanks, greg k-h ------------------ original commit in Linus's tree ------------------ From 096516d092d54604d590827d05b1022c8f326639 Mon Sep 17 00:00:00 2001 From: Justin Chen <justin.chen(a)broadcom.com> Date: Sat, 12 Aug 2023 21:41:47 -0700 Subject: [PATCH] net: phy: broadcom: stub c45 read/write for 54810 The 54810 does not support c45. The mmd_phy_indirect accesses return arbirtary values leading to odd behavior like saying it supports EEE when it doesn't. We also see that reading/writing these non-existent MMD registers leads to phy instability in some cases. Fixes: b14995ac2527 ("net: phy: broadcom: Add BCM54810 PHY entry") Signed-off-by: Justin Chen <justin.chen(a)broadcom.com> Reviewed-by: Florian Fainelli <florian.fainelli(a)broadcom.com> Link: https://lore.kernel.org/r/1691901708-28650-1-git-send-email-justin.chen@bro… Signed-off-by: Jakub Kicinski <kuba(a)kernel.org> diff --git a/drivers/net/phy/broadcom.c b/drivers/net/phy/broadcom.c index 59cae0d808aa..04b2e6eeb195 100644 --- a/drivers/net/phy/broadcom.c +++ b/drivers/net/phy/broadcom.c @@ -542,6 +542,17 @@ static int bcm54xx_resume(struct phy_device *phydev) return bcm54xx_config_init(phydev); } +static int bcm54810_read_mmd(struct phy_device *phydev, int devnum, u16 regnum) +{ + return -EOPNOTSUPP; +} + +static int bcm54810_write_mmd(struct phy_device *phydev, int devnum, u16 regnum, + u16 val) +{ + return -EOPNOTSUPP; +} + static int bcm54811_config_init(struct phy_device *phydev) { int err, reg; @@ -1103,6 +1114,8 @@ static struct phy_driver broadcom_drivers[] = { .get_strings = bcm_phy_get_strings, .get_stats = bcm54xx_get_stats, .probe = bcm54xx_phy_probe, + .read_mmd = bcm54810_read_mmd, + .write_mmd = bcm54810_write_mmd, .config_init = bcm54xx_config_init, .config_aneg = bcm5481_config_aneg, .config_intr = bcm_phy_config_intr,

1 year, 11 months

2
1
0 0

FAILED: patch "[PATCH] net: phy: broadcom: stub c45 read/write for 54810" failed to apply to 5.4-stable tree

by gregkh＠linuxfoundation.org

The patch below does not apply to the 5.4-stable tree. If someone wants it applied there, or to any other stable or longterm tree, then please email the backport, including the original git commit id to <stable(a)vger.kernel.org>. To reproduce the conflict and resubmit, you may use the following commands: git fetch https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/ linux-5.4.y git checkout FETCH_HEAD git cherry-pick -x 096516d092d54604d590827d05b1022c8f326639 # <resolve conflicts, build, test, etc.> git commit -s git send-email --to '<stable(a)vger.kernel.org>' --in-reply-to '2023082132-jaundice-applaud-eb72@gregkh' --subject-prefix 'PATCH 5.4.y' HEAD^.. Possible dependencies: 096516d092d5 ("net: phy: broadcom: stub c45 read/write for 54810") 5a32fcdb1e68 ("net: phy: broadcom: Add statistics for all Gigabit PHYs") 1e2e61af1996 ("net: phy: broadcom: remove BCM5482 1000Base-BX support") 15772e4ddf3f ("net: phy: broadcom: remove use of ack_interrupt()") 4567d5c3eb9b ("net: phy: broadcom: implement generic .handle_interrupt() callback") b0ed0bbfb304 ("net: phy: broadcom: add support for BCM54811 PHY") 9d42205036d4 ("net: phy: bcm54140: Make a bunch of functions static") 6937602ed3f9 ("net: phy: add Broadcom BCM54140 support") 123aff2a789c ("net: phy: broadcom: Add support for BCM53125 internal PHYs") fe26821fa614 ("net: phy: broadcom: Wire suspend/resume for BCM54810") 0ececcfc9267 ("net: phy: broadcom: Allow BCM54810 to use bcm54xx_adjust_rxrefclk()") 75f4d8d10e01 ("net: phy: add Broadcom BCM84881 PHY driver") b9bcb95315fe ("net: phy: broadcom: add 1000Base-X support for BCM54616S") thanks, greg k-h ------------------ original commit in Linus's tree ------------------ From 096516d092d54604d590827d05b1022c8f326639 Mon Sep 17 00:00:00 2001 From: Justin Chen <justin.chen(a)broadcom.com> Date: Sat, 12 Aug 2023 21:41:47 -0700 Subject: [PATCH] net: phy: broadcom: stub c45 read/write for 54810 The 54810 does not support c45. The mmd_phy_indirect accesses return arbirtary values leading to odd behavior like saying it supports EEE when it doesn't. We also see that reading/writing these non-existent MMD registers leads to phy instability in some cases. Fixes: b14995ac2527 ("net: phy: broadcom: Add BCM54810 PHY entry") Signed-off-by: Justin Chen <justin.chen(a)broadcom.com> Reviewed-by: Florian Fainelli <florian.fainelli(a)broadcom.com> Link: https://lore.kernel.org/r/1691901708-28650-1-git-send-email-justin.chen@bro… Signed-off-by: Jakub Kicinski <kuba(a)kernel.org> diff --git a/drivers/net/phy/broadcom.c b/drivers/net/phy/broadcom.c index 59cae0d808aa..04b2e6eeb195 100644 --- a/drivers/net/phy/broadcom.c +++ b/drivers/net/phy/broadcom.c @@ -542,6 +542,17 @@ static int bcm54xx_resume(struct phy_device *phydev) return bcm54xx_config_init(phydev); } +static int bcm54810_read_mmd(struct phy_device *phydev, int devnum, u16 regnum) +{ + return -EOPNOTSUPP; +} + +static int bcm54810_write_mmd(struct phy_device *phydev, int devnum, u16 regnum, + u16 val) +{ + return -EOPNOTSUPP; +} + static int bcm54811_config_init(struct phy_device *phydev) { int err, reg; @@ -1103,6 +1114,8 @@ static struct phy_driver broadcom_drivers[] = { .get_strings = bcm_phy_get_strings, .get_stats = bcm54xx_get_stats, .probe = bcm54xx_phy_probe, + .read_mmd = bcm54810_read_mmd, + .write_mmd = bcm54810_write_mmd, .config_init = bcm54xx_config_init, .config_aneg = bcm5481_config_aneg, .config_intr = bcm_phy_config_intr,

1 year, 11 months

2
1
0 0

[merged mm-nonmm-stable] nilfs2-fix-warning-in-mark_buffer_dirty-due-to-discarded-buffer-reuse.patch removed from -mm tree

by Andrew Morton

The quilt patch titled Subject: nilfs2: fix WARNING in mark_buffer_dirty due to discarded buffer reuse has been removed from the -mm tree. Its filename was nilfs2-fix-warning-in-mark_buffer_dirty-due-to-discarded-buffer-reuse.patch This patch was dropped because it was merged into the mm-nonmm-stable branch of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm ------------------------------------------------------ From: Ryusuke Konishi <konishi.ryusuke(a)gmail.com> Subject: nilfs2: fix WARNING in mark_buffer_dirty due to discarded buffer reuse Date: Fri, 18 Aug 2023 22:18:04 +0900 A syzbot stress test using a corrupted disk image reported that mark_buffer_dirty() called from __nilfs_mark_inode_dirty() or nilfs_palloc_commit_alloc_entry() may output a kernel warning, and can panic if the kernel is booted with panic_on_warn. This is because nilfs2 keeps buffer pointers in local structures for some metadata and reuses them, but such buffers may be forcibly discarded by nilfs_clear_dirty_page() in some critical situations. This issue is reported to appear after commit 28a65b49eb53 ("nilfs2: do not write dirty data after degenerating to read-only"), but the issue has potentially existed before. Fix this issue by checking the uptodate flag when attempting to reuse an internally held buffer, and reloading the metadata instead of reusing the buffer if the flag was lost. Link: https://lkml.kernel.org/r/20230818131804.7758-1-konishi.ryusuke@gmail.com Signed-off-by: Ryusuke Konishi <konishi.ryusuke(a)gmail.com> Reported-by: syzbot+cdfcae656bac88ba0e2d(a)syzkaller.appspotmail.com Closes: https://lkml.kernel.org/r/0000000000003da75f05fdeffd12@google.com Fixes: 8c26c4e2694a ("nilfs2: fix issue with flush kernel thread after remount in RO mode because of driver's internal error or metadata corruption") Tested-by: Ryusuke Konishi <konishi.ryusuke(a)gmail.com> Cc: <stable(a)vger.kernel.org> # 3.10+ Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org> --- fs/nilfs2/alloc.c | 3 ++- fs/nilfs2/inode.c | 7 +++++-- 2 files changed, 7 insertions(+), 3 deletions(-) --- a/fs/nilfs2/alloc.c~nilfs2-fix-warning-in-mark_buffer_dirty-due-to-discarded-buffer-reuse +++ a/fs/nilfs2/alloc.c @@ -205,7 +205,8 @@ static int nilfs_palloc_get_block(struct int ret; spin_lock(lock); - if (prev->bh && blkoff == prev->blkoff) { + if (prev->bh && blkoff == prev->blkoff && + likely(buffer_uptodate(prev->bh))) { get_bh(prev->bh); *bhp = prev->bh; spin_unlock(lock); --- a/fs/nilfs2/inode.c~nilfs2-fix-warning-in-mark_buffer_dirty-due-to-discarded-buffer-reuse +++ a/fs/nilfs2/inode.c @@ -1025,7 +1025,7 @@ int nilfs_load_inode_block(struct inode int err; spin_lock(&nilfs->ns_inode_lock); - if (ii->i_bh == NULL) { + if (ii->i_bh == NULL || unlikely(!buffer_uptodate(ii->i_bh))) { spin_unlock(&nilfs->ns_inode_lock); err = nilfs_ifile_get_inode_block(ii->i_root->ifile, inode->i_ino, pbh); @@ -1034,7 +1034,10 @@ int nilfs_load_inode_block(struct inode spin_lock(&nilfs->ns_inode_lock); if (ii->i_bh == NULL) ii->i_bh = *pbh; - else { + else if (unlikely(!buffer_uptodate(ii->i_bh))) { + __brelse(ii->i_bh); + ii->i_bh = *pbh; + } else { brelse(*pbh); *pbh = ii->i_bh; } _ Patches currently in -mm which might be from konishi.ryusuke(a)gmail.com are

1 year, 11 months

1
0
0 0

Re: [PATCH 6.4 000/234] 6.4.12-rc1 review

by Ronald Warsow

Hi Greg 6.4.12-rc1 compiles, boots and runs here on x86_64 (Intel Rocket Lake, i5-11400) Thanks Tested-by: Ronald Warsow <rwarsow(a)gmx.de>

1 year, 11 months

1
0
0 0

[merged mm-stable] memfd-replace-ratcheting-feature-from-vmmemfd_noexec-with-hierarchy.patch removed from -mm tree

by Andrew Morton

The quilt patch titled Subject: memfd: replace ratcheting feature from vm.memfd_noexec with hierarchy has been removed from the -mm tree. Its filename was memfd-replace-ratcheting-feature-from-vmmemfd_noexec-with-hierarchy.patch This patch was dropped because it was merged into the mm-stable branch of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm ------------------------------------------------------ From: Aleksa Sarai <cyphar(a)cyphar.com> Subject: memfd: replace ratcheting feature from vm.memfd_noexec with hierarchy Date: Mon, 14 Aug 2023 18:41:00 +1000 This sysctl has the very unusual behaviour of not allowing any user (even CAP_SYS_ADMIN) to reduce the restriction setting, meaning that if you were to set this sysctl to a more restrictive option in the host pidns you would need to reboot your machine in order to reset it. The justification given in [1] is that this is a security feature and thus it should not be possible to disable. Aside from the fact that we have plenty of security-related sysctls that can be disabled after being enabled (fs.protected_symlinks for instance), the protection provided by the sysctl is to stop users from being able to create a binary and then execute it. A user with CAP_SYS_ADMIN can trivially do this without memfd_create(2): % cat mount-memfd.c #include <fcntl.h> #include <string.h> #include <stdio.h> #include <stdlib.h> #include <unistd.h> #include <linux/mount.h> #define SHELLCODE "#!/bin/echo this file was executed from this totally private tmpfs:" int main(void) { int fsfd = fsopen("tmpfs", FSOPEN_CLOEXEC); assert(fsfd >= 0); assert(!fsconfig(fsfd, FSCONFIG_CMD_CREATE, NULL, NULL, 2)); int dfd = fsmount(fsfd, FSMOUNT_CLOEXEC, 0); assert(dfd >= 0); int execfd = openat(dfd, "exe", O_CREAT | O_RDWR | O_CLOEXEC, 0782); assert(execfd >= 0); assert(write(execfd, SHELLCODE, strlen(SHELLCODE)) == strlen(SHELLCODE)); assert(!close(execfd)); char *execpath = NULL; char *argv[] = { "bad-exe", NULL }, *envp[] = { NULL }; execfd = openat(dfd, "exe", O_PATH | O_CLOEXEC); assert(execfd >= 0); assert(asprintf(&execpath, "/proc/self/fd/%d", execfd) > 0); assert(!execve(execpath, argv, envp)); } % ./mount-memfd this file was executed from this totally private tmpfs: /proc/self/fd/5 % Given that it is possible for CAP_SYS_ADMIN users to create executable binaries without memfd_create(2) and without touching the host filesystem (not to mention the many other things a CAP_SYS_ADMIN process would be able to do that would be equivalent or worse), it seems strange to cause a fair amount of headache to admins when there doesn't appear to be an actual security benefit to blocking this. There appear to be concerns about confused-deputy-esque attacks[2] but a confused deputy that can write to arbitrary sysctls is a bigger security issue than executable memfds. /* New API */ The primary requirement from the original author appears to be more based on the need to be able to restrict an entire system in a hierarchical manner[3], such that child namespaces cannot re-enable executable memfds. So, implement that behaviour explicitly -- the vm.memfd_noexec scope is evaluated up the pidns tree to &init_pid_ns and you have the most restrictive value applied to you. The new lower limit you can set vm.memfd_noexec is whatever limit applies to your parent. Note that a pidns will inherit a copy of the parent pidns's effective vm.memfd_noexec setting at unshare() time. This matches the existing behaviour, and it also ensures that a pidns will never have its vm.memfd_noexec setting *lowered* behind its back (but it will be raised if the parent raises theirs). /* Backwards Compatibility */ As the previous version of the sysctl didn't allow you to lower the setting at all, there are no backwards compatibility issues with this aspect of the change. However it should be noted that now that the setting is completely hierarchical. Previously, a cloned pidns would just copy the current pidns setting, meaning that if the parent's vm.memfd_noexec was changed it wouldn't propoagate to existing pid namespaces. Now, the restriction applies recursively. This is a uAPI change, however: * The sysctl is very new, having been merged in 6.3. * Several aspects of the sysctl were broken up until this patchset and the other patchset by Jeff Xu last month. And thus it seems incredibly unlikely that any real users would run into this issue. In the worst case, if this causes userspace isues we could make it so that modifying the setting follows the hierarchical rules but the restriction checking uses the cached copy. [1]: https://lore.kernel.org/CABi2SkWnAgHK1i6iqSqPMYuNEhtHBkO8jUuCvmG3RmUB5TKHJw… [2]: https://lore.kernel.org/CALmYWFs_dNCzw_pW1yRAo4bGCPEtykroEQaowNULp7svwMLjOg… [3]: https://lore.kernel.org/CALmYWFuahdUF7cT4cm7_TGLqPanuHXJ-hVSfZt7vpTnc18DPrw… Link: https://lkml.kernel.org/r/20230814-memfd-vm-noexec-uapi-fixes-v2-4-7ff9e3e1… Fixes: 105ff5339f49 ("mm/memfd: add MFD_NOEXEC_SEAL and MFD_EXEC") Signed-off-by: Aleksa Sarai <cyphar(a)cyphar.com> Cc: Dominique Martinet <asmadeus(a)codewreck.org> Cc: Christian Brauner <brauner(a)kernel.org> Cc: Daniel Verkamp <dverkamp(a)chromium.org> Cc: Jeff Xu <jeffxu(a)google.com> Cc: Kees Cook <keescook(a)chromium.org> Cc: Shuah Khan <shuah(a)kernel.org> Cc: <stable(a)vger.kernel.org> Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org> --- include/linux/pid_namespace.h | 23 ++++++++++++++++++++++- kernel/pid.c | 3 +++ kernel/pid_namespace.c | 6 +++--- kernel/pid_sysctl.h | 30 +++++++++++++----------------- mm/memfd.c | 3 ++- 5 files changed, 43 insertions(+), 22 deletions(-) --- a/include/linux/pid_namespace.h~memfd-replace-ratcheting-feature-from-vmmemfd_noexec-with-hierarchy +++ a/include/linux/pid_namespace.h @@ -39,7 +39,6 @@ struct pid_namespace { int reboot; /* group exit code if this pidns was rebooted */ struct ns_common ns; #if defined(CONFIG_SYSCTL) && defined(CONFIG_MEMFD_CREATE) - /* sysctl for vm.memfd_noexec */ int memfd_noexec_scope; #endif } __randomize_layout; @@ -56,6 +55,23 @@ static inline struct pid_namespace *get_ return ns; } +#if defined(CONFIG_SYSCTL) && defined(CONFIG_MEMFD_CREATE) +static inline int pidns_memfd_noexec_scope(struct pid_namespace *ns) +{ + int scope = MEMFD_NOEXEC_SCOPE_EXEC; + + for (; ns; ns = ns->parent) + scope = max(scope, READ_ONCE(ns->memfd_noexec_scope)); + + return scope; +} +#else +static inline int pidns_memfd_noexec_scope(struct pid_namespace *ns) +{ + return 0; +} +#endif + extern struct pid_namespace *copy_pid_ns(unsigned long flags, struct user_namespace *user_ns, struct pid_namespace *ns); extern void zap_pid_ns_processes(struct pid_namespace *pid_ns); @@ -70,6 +86,11 @@ static inline struct pid_namespace *get_ return ns; } +static inline int pidns_memfd_noexec_scope(struct pid_namespace *ns) +{ + return 0; +} + static inline struct pid_namespace *copy_pid_ns(unsigned long flags, struct user_namespace *user_ns, struct pid_namespace *ns) { --- a/kernel/pid.c~memfd-replace-ratcheting-feature-from-vmmemfd_noexec-with-hierarchy +++ a/kernel/pid.c @@ -83,6 +83,9 @@ struct pid_namespace init_pid_ns = { #ifdef CONFIG_PID_NS .ns.ops = &pidns_operations, #endif +#if defined(CONFIG_SYSCTL) && defined(CONFIG_MEMFD_CREATE) + .memfd_noexec_scope = MEMFD_NOEXEC_SCOPE_EXEC, +#endif }; EXPORT_SYMBOL_GPL(init_pid_ns); --- a/kernel/pid_namespace.c~memfd-replace-ratcheting-feature-from-vmmemfd_noexec-with-hierarchy +++ a/kernel/pid_namespace.c @@ -110,9 +110,9 @@ static struct pid_namespace *create_pid_ ns->user_ns = get_user_ns(user_ns); ns->ucounts = ucounts; ns->pid_allocated = PIDNS_ADDING; - - initialize_memfd_noexec_scope(ns); - +#if defined(CONFIG_SYSCTL) && defined(CONFIG_MEMFD_CREATE) + ns->memfd_noexec_scope = pidns_memfd_noexec_scope(parent_pid_ns); +#endif return ns; out_free_idr: --- a/kernel/pid_sysctl.h~memfd-replace-ratcheting-feature-from-vmmemfd_noexec-with-hierarchy +++ a/kernel/pid_sysctl.h @@ -5,33 +5,30 @@ #include <linux/pid_namespace.h> #if defined(CONFIG_SYSCTL) && defined(CONFIG_MEMFD_CREATE) -static inline void initialize_memfd_noexec_scope(struct pid_namespace *ns) -{ - ns->memfd_noexec_scope = - task_active_pid_ns(current)->memfd_noexec_scope; -} - static int pid_mfd_noexec_dointvec_minmax(struct ctl_table *table, int write, void *buf, size_t *lenp, loff_t *ppos) { struct pid_namespace *ns = task_active_pid_ns(current); struct ctl_table table_copy; + int err, scope, parent_scope; if (write && !ns_capable(ns->user_ns, CAP_SYS_ADMIN)) return -EPERM; table_copy = *table; - if (ns != &init_pid_ns) - table_copy.data = &ns->memfd_noexec_scope; - - /* - * set minimum to current value, the effect is only bigger - * value is accepted. - */ - if (*(int *)table_copy.data > *(int *)table_copy.extra1) - table_copy.extra1 = table_copy.data; - return proc_dointvec_minmax(&table_copy, write, buf, lenp, ppos); + /* You cannot set a lower enforcement value than your parent. */ + parent_scope = pidns_memfd_noexec_scope(ns->parent); + /* Equivalent to pidns_memfd_noexec_scope(ns). */ + scope = max(READ_ONCE(ns->memfd_noexec_scope), parent_scope); + + table_copy.data = &scope; + table_copy.extra1 = &parent_scope; + + err = proc_dointvec_minmax(&table_copy, write, buf, lenp, ppos); + if (!err && write) + WRITE_ONCE(ns->memfd_noexec_scope, scope); + return err; } static struct ctl_table pid_ns_ctl_table_vm[] = { @@ -51,7 +48,6 @@ static inline void register_pid_ns_sysct register_sysctl("vm", pid_ns_ctl_table_vm); } #else -static inline void initialize_memfd_noexec_scope(struct pid_namespace *ns) {} static inline void register_pid_ns_sysctl_table_vm(void) {} #endif --- a/mm/memfd.c~memfd-replace-ratcheting-feature-from-vmmemfd_noexec-with-hierarchy +++ a/mm/memfd.c @@ -271,7 +271,8 @@ long memfd_fcntl(struct file *file, unsi static int check_sysctl_memfd_noexec(unsigned int *flags) { #ifdef CONFIG_SYSCTL - int sysctl = task_active_pid_ns(current)->memfd_noexec_scope; + struct pid_namespace *ns = task_active_pid_ns(current); + int sysctl = pidns_memfd_noexec_scope(ns); if (!(*flags & (MFD_EXEC | MFD_NOEXEC_SEAL))) { if (sysctl >= MEMFD_NOEXEC_SCOPE_NOEXEC_SEAL) _ Patches currently in -mm which might be from cyphar(a)cyphar.com are

1 year, 11 months

1
0
0 0

[merged mm-stable] memfd-improve-userspace-warnings-for-missing-exec-related-flags.patch removed from -mm tree

by Andrew Morton

The quilt patch titled Subject: memfd: improve userspace warnings for missing exec-related flags has been removed from the -mm tree. Its filename was memfd-improve-userspace-warnings-for-missing-exec-related-flags.patch This patch was dropped because it was merged into the mm-stable branch of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm ------------------------------------------------------ From: Aleksa Sarai <cyphar(a)cyphar.com> Subject: memfd: improve userspace warnings for missing exec-related flags Date: Mon, 14 Aug 2023 18:40:59 +1000 In order to incentivise userspace to switch to passing MFD_EXEC and MFD_NOEXEC_SEAL, we need to provide a warning on each attempt to call memfd_create() without the new flags. pr_warn_once() is not useful because on most systems the one warning is burned up during the boot process (on my system, systemd does this within the first second of boot) and thus userspace will in practice never see the warnings to push them to switch to the new flags. The original patchset[1] used pr_warn_ratelimited(), however there were concerns about the degree of spam in the kernel log[2,3]. The resulting inability to detect every case was flagged as an issue at the time[4]. While we could come up with an alternative rate-limiting scheme such as only outputting the message if vm.memfd_noexec has been modified, or only outputting the message once for a given task, these alternatives have downsides that don't make sense given how low-stakes a single kernel warning message is. Switching to pr_info_ratelimited() instead should be fine -- it's possible some monitoring tool will be unhappy with a stream of warning-level messages but there's already plenty of info-level message spam in dmesg. [1]: https://lore.kernel.org/20221215001205.51969-4-jeffxu@google.com/ [2]: https://lore.kernel.org/202212161233.85C9783FB@keescook/ [3]: https://lore.kernel.org/Y5yS8wCnuYGLHMj4@x1n/ [4]: https://lore.kernel.org/f185bb42-b29c-977e-312e-3349eea15383@linuxfoundatio… Link: https://lkml.kernel.org/r/20230814-memfd-vm-noexec-uapi-fixes-v2-3-7ff9e3e1… Fixes: 105ff5339f49 ("mm/memfd: add MFD_NOEXEC_SEAL and MFD_EXEC") Signed-off-by: Aleksa Sarai <cyphar(a)cyphar.com> Cc: Christian Brauner <brauner(a)kernel.org> Cc: Daniel Verkamp <dverkamp(a)chromium.org> Cc: Dominique Martinet <asmadeus(a)codewreck.org> Cc: Kees Cook <keescook(a)chromium.org> Cc: Shuah Khan <shuah(a)kernel.org> Cc: <stable(a)vger.kernel.org> Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org> --- mm/memfd.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) --- a/mm/memfd.c~memfd-improve-userspace-warnings-for-missing-exec-related-flags +++ a/mm/memfd.c @@ -315,7 +315,7 @@ SYSCALL_DEFINE2(memfd_create, return -EINVAL; if (!(flags & (MFD_EXEC | MFD_NOEXEC_SEAL))) { - pr_warn_once( + pr_info_ratelimited( "%s[%d]: memfd_create() called without MFD_EXEC or MFD_NOEXEC_SEAL set\n", current->comm, task_pid_nr(current)); } _ Patches currently in -mm which might be from cyphar(a)cyphar.com are

1 year, 11 months

1
0
0 0

[merged mm-stable] memfd-do-not-eacces-old-memfd_create-users-with-vmmemfd_noexec=2.patch removed from -mm tree

by Andrew Morton

The quilt patch titled Subject: memfd: do not -EACCES old memfd_create() users with vm.memfd_noexec=2 has been removed from the -mm tree. Its filename was memfd-do-not-eacces-old-memfd_create-users-with-vmmemfd_noexec=2.patch This patch was dropped because it was merged into the mm-stable branch of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm ------------------------------------------------------ From: Aleksa Sarai <cyphar(a)cyphar.com> Subject: memfd: do not -EACCES old memfd_create() users with vm.memfd_noexec=2 Date: Mon, 14 Aug 2023 18:40:58 +1000 Given the difficulty of auditing all of userspace to figure out whether every memfd_create() user has switched to passing MFD_EXEC and MFD_NOEXEC_SEAL flags, it seems far less distruptive to make it possible for older programs that don't make use of executable memfds to run under vm.memfd_noexec=2. Otherwise, a small dependency change can result in spurious errors. For programs that don't use executable memfds, passing MFD_NOEXEC_SEAL is functionally a no-op and thus having the same In addition, every failure under vm.memfd_noexec=2 needs to print to the kernel log so that userspace can figure out where the error came from. The concerns about pr_warn_ratelimited() spam that caused the switch to pr_warn_once()[1,2] do not apply to the vm.memfd_noexec=2 case. This is a user-visible API change, but as it allows programs to do something that would be blocked before, and the sysctl itself was broken and recently released, it seems unlikely this will cause any issues. [1]: https://lore.kernel.org/Y5yS8wCnuYGLHMj4@x1n/ [2]: https://lore.kernel.org/202212161233.85C9783FB@keescook/ Link: https://lkml.kernel.org/r/20230814-memfd-vm-noexec-uapi-fixes-v2-2-7ff9e3e1… Fixes: 105ff5339f49 ("mm/memfd: add MFD_NOEXEC_SEAL and MFD_EXEC") Signed-off-by: Aleksa Sarai <cyphar(a)cyphar.com> Cc: Dominique Martinet <asmadeus(a)codewreck.org> Cc: Christian Brauner <brauner(a)kernel.org> Cc: Daniel Verkamp <dverkamp(a)chromium.org> Cc: Jeff Xu <jeffxu(a)google.com> Cc: Kees Cook <keescook(a)chromium.org> Cc: Shuah Khan <shuah(a)kernel.org> Cc: <stable(a)vger.kernel.org> Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org> --- include/linux/pid_namespace.h | 16 ++-------- mm/memfd.c | 30 ++++++------------- tools/testing/selftests/memfd/memfd_test.c | 22 ++++++++++--- 3 files changed, 32 insertions(+), 36 deletions(-) --- a/include/linux/pid_namespace.h~memfd-do-not-eacces-old-memfd_create-users-with-vmmemfd_noexec=2 +++ a/include/linux/pid_namespace.h @@ -17,18 +17,10 @@ struct fs_pin; #if defined(CONFIG_SYSCTL) && defined(CONFIG_MEMFD_CREATE) -/* - * sysctl for vm.memfd_noexec - * 0: memfd_create() without MFD_EXEC nor MFD_NOEXEC_SEAL - * acts like MFD_EXEC was set. - * 1: memfd_create() without MFD_EXEC nor MFD_NOEXEC_SEAL - * acts like MFD_NOEXEC_SEAL was set. - * 2: memfd_create() without MFD_NOEXEC_SEAL will be - * rejected. - */ -#define MEMFD_NOEXEC_SCOPE_EXEC 0 -#define MEMFD_NOEXEC_SCOPE_NOEXEC_SEAL 1 -#define MEMFD_NOEXEC_SCOPE_NOEXEC_ENFORCED 2 +/* modes for vm.memfd_noexec sysctl */ +#define MEMFD_NOEXEC_SCOPE_EXEC 0 /* MFD_EXEC implied if unset */ +#define MEMFD_NOEXEC_SCOPE_NOEXEC_SEAL 1 /* MFD_NOEXEC_SEAL implied if unset */ +#define MEMFD_NOEXEC_SCOPE_NOEXEC_ENFORCED 2 /* same as 1, except MFD_EXEC rejected */ #endif struct pid_namespace { --- a/mm/memfd.c~memfd-do-not-eacces-old-memfd_create-users-with-vmmemfd_noexec=2 +++ a/mm/memfd.c @@ -271,30 +271,22 @@ long memfd_fcntl(struct file *file, unsi static int check_sysctl_memfd_noexec(unsigned int *flags) { #ifdef CONFIG_SYSCTL - char comm[TASK_COMM_LEN]; - int sysctl = MEMFD_NOEXEC_SCOPE_EXEC; - struct pid_namespace *ns; - - ns = task_active_pid_ns(current); - if (ns) - sysctl = ns->memfd_noexec_scope; + int sysctl = task_active_pid_ns(current)->memfd_noexec_scope; if (!(*flags & (MFD_EXEC | MFD_NOEXEC_SEAL))) { - if (sysctl == MEMFD_NOEXEC_SCOPE_NOEXEC_SEAL) + if (sysctl >= MEMFD_NOEXEC_SCOPE_NOEXEC_SEAL) *flags |= MFD_NOEXEC_SEAL; else *flags |= MFD_EXEC; } - if (*flags & MFD_EXEC && sysctl >= MEMFD_NOEXEC_SCOPE_NOEXEC_ENFORCED) { - pr_warn_once( - "memfd_create(): MFD_NOEXEC_SEAL is enforced, pid=%d '%s'\n", - task_pid_nr(current), get_task_comm(comm, current)); - + if (!(*flags & MFD_NOEXEC_SEAL) && sysctl >= MEMFD_NOEXEC_SCOPE_NOEXEC_ENFORCED) { + pr_err_ratelimited( + "%s[%d]: memfd_create() requires MFD_NOEXEC_SEAL with vm.memfd_noexec=%d\n", + current->comm, task_pid_nr(current), sysctl); return -EACCES; } #endif - return 0; } @@ -302,7 +294,6 @@ SYSCALL_DEFINE2(memfd_create, const char __user *, uname, unsigned int, flags) { - char comm[TASK_COMM_LEN]; unsigned int *file_seals; struct file *file; int fd, error; @@ -325,12 +316,13 @@ SYSCALL_DEFINE2(memfd_create, if (!(flags & (MFD_EXEC | MFD_NOEXEC_SEAL))) { pr_warn_once( - "memfd_create() without MFD_EXEC nor MFD_NOEXEC_SEAL, pid=%d '%s'\n", - task_pid_nr(current), get_task_comm(comm, current)); + "%s[%d]: memfd_create() called without MFD_EXEC or MFD_NOEXEC_SEAL set\n", + current->comm, task_pid_nr(current)); } - if (check_sysctl_memfd_noexec(&flags) < 0) - return -EACCES; + error = check_sysctl_memfd_noexec(&flags); + if (error < 0) + return error; /* length includes terminating zero */ len = strnlen_user(uname, MFD_NAME_MAX_LEN + 1); --- a/tools/testing/selftests/memfd/memfd_test.c~memfd-do-not-eacces-old-memfd_create-users-with-vmmemfd_noexec=2 +++ a/tools/testing/selftests/memfd/memfd_test.c @@ -1145,11 +1145,23 @@ static void test_sysctl_child(void) printf("%s sysctl 2\n", memfd_str); sysctl_assert_write("2"); - mfd_fail_new("kern_memfd_sysctl_2", - MFD_CLOEXEC | MFD_ALLOW_SEALING); - mfd_fail_new("kern_memfd_sysctl_2_MFD_EXEC", - MFD_CLOEXEC | MFD_EXEC); - fd = mfd_assert_new("", 0, MFD_NOEXEC_SEAL); + mfd_fail_new("kern_memfd_sysctl_2_exec", + MFD_EXEC | MFD_CLOEXEC | MFD_ALLOW_SEALING); + + fd = mfd_assert_new("kern_memfd_sysctl_2_dfl", + mfd_def_size, + MFD_CLOEXEC | MFD_ALLOW_SEALING); + mfd_assert_mode(fd, 0666); + mfd_assert_has_seals(fd, F_SEAL_EXEC); + mfd_fail_chmod(fd, 0777); + close(fd); + + fd = mfd_assert_new("kern_memfd_sysctl_2_noexec_seal", + mfd_def_size, + MFD_NOEXEC_SEAL | MFD_CLOEXEC | MFD_ALLOW_SEALING); + mfd_assert_mode(fd, 0666); + mfd_assert_has_seals(fd, F_SEAL_EXEC); + mfd_fail_chmod(fd, 0777); close(fd); sysctl_fail_write("0"); _ Patches currently in -mm which might be from cyphar(a)cyphar.com are

1 year, 11 months

1
0
0 0

[merged mm-stable] mm-unstable-multi-gen-lru-avoid-race-in-inc_min_seq.patch removed from -mm tree

by Andrew Morton

The quilt patch titled Subject: Multi-gen LRU: avoid race in inc_min_seq() has been removed from the -mm tree. Its filename was mm-unstable-multi-gen-lru-avoid-race-in-inc_min_seq.patch This patch was dropped because it was merged into the mm-stable branch of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm ------------------------------------------------------ From: Kalesh Singh <kaleshsingh(a)google.com> Subject: Multi-gen LRU: avoid race in inc_min_seq() Date: Tue, 1 Aug 2023 19:56:03 -0700 inc_max_seq() will try to inc_min_seq() if nr_gens == MAX_NR_GENS. This is because the generations are reused (the last oldest now empty generation will become the next youngest generation). inc_min_seq() is retried until successful, dropping the lru_lock and yielding the CPU on each failure, and retaking the lock before trying again: while (!inc_min_seq(lruvec, type, can_swap)) { spin_unlock_irq(&lruvec->lru_lock); cond_resched(); spin_lock_irq(&lruvec->lru_lock); } However, the initial condition that required incrementing the min_seq (nr_gens == MAX_NR_GENS) is not retested. This can change by another call to inc_max_seq() from run_aging() with force_scan=true from the debugfs interface. Since the eviction stalls when the nr_gens == MIN_NR_GENS, avoid unnecessarily incrementing the min_seq by rechecking the number of generations before each attempt. This issue was uncovered in previous discussion on the list by Yu Zhao and Aneesh Kumar [1]. [1] https://lore.kernel.org/linux-mm/CAOUHufbO7CaVm=xjEb1avDhHVvnC8pJmGyKcFf2iY… Link: https://lkml.kernel.org/r/20230802025606.346758-2-kaleshsingh@google.com Fixes: d6c3af7d8a2b ("mm: multi-gen LRU: debugfs interface") Signed-off-by: Kalesh Singh <kaleshsingh(a)google.com> Tested-by: AngeloGioacchino Del Regno <angelogioacchino.delregno(a)collabora.com> [mediatek] Tested-by: Charan Teja Kalla <quic_charante(a)quicinc.com> Cc: Yu Zhao <yuzhao(a)google.com> Cc: Aneesh Kumar K V <aneesh.kumar(a)linux.ibm.com> Cc: Barry Song <baohua(a)kernel.org> Cc: Brian Geffon <bgeffon(a)google.com> Cc: Jan Alexander Steffens (heftig) <heftig(a)archlinux.org> Cc: Lecopzer Chen <lecopzer.chen(a)mediatek.com> Cc: Matthias Brugger <matthias.bgg(a)gmail.com> Cc: Oleksandr Natalenko <oleksandr(a)natalenko.name> Cc: Qi Zheng <zhengqi.arch(a)bytedance.com> Cc: Steven Barrett <steven(a)liquorix.net> Cc: Suleiman Souhlal <suleiman(a)google.com> Cc: Suren Baghdasaryan <surenb(a)google.com> Cc: <stable(a)vger.kernel.org> Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org> --- mm/vmscan.c | 13 +++++++------ 1 file changed, 7 insertions(+), 6 deletions(-) --- a/mm/vmscan.c~mm-unstable-multi-gen-lru-avoid-race-in-inc_min_seq +++ a/mm/vmscan.c @@ -4439,7 +4439,7 @@ static void inc_max_seq(struct lruvec *l int prev, next; int type, zone; struct lru_gen_folio *lrugen = &lruvec->lrugen; - +restart: spin_lock_irq(&lruvec->lru_lock); VM_WARN_ON_ONCE(!seq_is_valid(lruvec)); @@ -4450,11 +4450,12 @@ static void inc_max_seq(struct lruvec *l VM_WARN_ON_ONCE(!force_scan && (type == LRU_GEN_FILE || can_swap)); - while (!inc_min_seq(lruvec, type, can_swap)) { - spin_unlock_irq(&lruvec->lru_lock); - cond_resched(); - spin_lock_irq(&lruvec->lru_lock); - } + if (inc_min_seq(lruvec, type, can_swap)) + continue; + + spin_unlock_irq(&lruvec->lru_lock); + cond_resched(); + goto restart; } /* _ Patches currently in -mm which might be from kaleshsingh(a)google.com are

1 year, 11 months

1
0
0 0

[merged mm-stable] mm-unstable-multi-gen-lru-fix-per-zone-reclaim.patch removed from -mm tree

by Andrew Morton

The quilt patch titled Subject: Multi-gen LRU: fix per-zone reclaim has been removed from the -mm tree. Its filename was mm-unstable-multi-gen-lru-fix-per-zone-reclaim.patch This patch was dropped because it was merged into the mm-stable branch of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm ------------------------------------------------------ From: Kalesh Singh <kaleshsingh(a)google.com> Subject: Multi-gen LRU: fix per-zone reclaim Date: Tue, 1 Aug 2023 19:56:02 -0700 MGLRU has a LRU list for each zone for each type (anon/file) in each generation: long nr_pages[MAX_NR_GENS][ANON_AND_FILE][MAX_NR_ZONES]; The min_seq (oldest generation) can progress independently for each type but the max_seq (youngest generation) is shared for both anon and file. This is to maintain a common frame of reference. In order for eviction to advance the min_seq of a type, all the per-zone lists in the oldest generation of that type must be empty. The eviction logic only considers pages from eligible zones for eviction or promotion. scan_folios() { ... for (zone = sc->reclaim_idx; zone >= 0; zone--) { ... sort_folio(); // Promote ... isolate_folio(); // Evict } ... } Consider the system has the movable zone configured and default 4 generations. The current state of the system is as shown below (only illustrating one type for simplicity): Type: ANON Zone DMA32 Normal Movable Device Gen 0 0 0 4GB 0 Gen 1 0 1GB 1MB 0 Gen 2 1MB 4GB 1MB 0 Gen 3 1MB 1MB 1MB 0 Now consider there is a GFP_KERNEL allocation request (eligible zone index <= Normal), evict_folios() will return without doing any work since there are no pages to scan in the eligible zones of the oldest generation. Reclaim won't make progress until triggered from a ZONE_MOVABLE allocation request; which may not happen soon if there is a lot of free memory in the movable zone. This can lead to OOM kills, although there is 1GB pages in the Normal zone of Gen 1 that we have not yet tried to reclaim. This issue is not seen in the conventional active/inactive LRU since there are no per-zone lists. If there are no (not enough) folios to scan in the eligible zones, move folios from ineligible zone (zone_index > reclaim_index) to the next generation. This allows for the progression of min_seq and reclaiming from the next generation (Gen 1). Qualcomm, Mediatek and raspberrypi [1] discovered this issue independently. [1] https://github.com/raspberrypi/linux/issues/5395 Link: https://lkml.kernel.org/r/20230802025606.346758-1-kaleshsingh@google.com Fixes: ac35a4902374 ("mm: multi-gen LRU: minimal implementation") Signed-off-by: Kalesh Singh <kaleshsingh(a)google.com> Reported-by: Charan Teja Kalla <quic_charante(a)quicinc.com> Reported-by: Lecopzer Chen <lecopzer.chen(a)mediatek.com> Tested-by: AngeloGioacchino Del Regno <angelogioacchino.delregno(a)collabora.com> [mediatek] Tested-by: Charan Teja Kalla <quic_charante(a)quicinc.com> Cc: Yu Zhao <yuzhao(a)google.com> Cc: Barry Song <baohua(a)kernel.org> Cc: Brian Geffon <bgeffon(a)google.com> Cc: Jan Alexander Steffens (heftig) <heftig(a)archlinux.org> Cc: Matthias Brugger <matthias.bgg(a)gmail.com> Cc: Oleksandr Natalenko <oleksandr(a)natalenko.name> Cc: Qi Zheng <zhengqi.arch(a)bytedance.com> Cc: Steven Barrett <steven(a)liquorix.net> Cc: Suleiman Souhlal <suleiman(a)google.com> Cc: Suren Baghdasaryan <surenb(a)google.com> Cc: Aneesh Kumar K V <aneesh.kumar(a)linux.ibm.com> Cc: <stable(a)vger.kernel.org> Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org> --- mm/vmscan.c | 18 ++++++++++++++---- 1 file changed, 14 insertions(+), 4 deletions(-) --- a/mm/vmscan.c~mm-unstable-multi-gen-lru-fix-per-zone-reclaim +++ a/mm/vmscan.c @@ -4889,7 +4889,8 @@ static int lru_gen_memcg_seg(struct lruv * the eviction ******************************************************************************/ -static bool sort_folio(struct lruvec *lruvec, struct folio *folio, int tier_idx) +static bool sort_folio(struct lruvec *lruvec, struct folio *folio, struct scan_control *sc, + int tier_idx) { bool success; int gen = folio_lru_gen(folio); @@ -4939,6 +4940,13 @@ static bool sort_folio(struct lruvec *lr return true; } + /* ineligible */ + if (zone > sc->reclaim_idx) { + gen = folio_inc_gen(lruvec, folio, false); + list_move_tail(&folio->lru, &lrugen->folios[gen][type][zone]); + return true; + } + /* waiting for writeback */ if (folio_test_locked(folio) || folio_test_writeback(folio) || (type == LRU_GEN_FILE && folio_test_dirty(folio))) { @@ -4987,7 +4995,8 @@ static bool isolate_folio(struct lruvec static int scan_folios(struct lruvec *lruvec, struct scan_control *sc, int type, int tier, struct list_head *list) { - int gen, zone; + int i; + int gen; enum vm_event_item item; int sorted = 0; int scanned = 0; @@ -5003,9 +5012,10 @@ static int scan_folios(struct lruvec *lr gen = lru_gen_from_seq(lrugen->min_seq[type]); - for (zone = sc->reclaim_idx; zone >= 0; zone--) { + for (i = MAX_NR_ZONES; i > 0; i--) { LIST_HEAD(moved); int skipped = 0; + int zone = (sc->reclaim_idx + i) % MAX_NR_ZONES; struct list_head *head = &lrugen->folios[gen][type][zone]; while (!list_empty(head)) { @@ -5019,7 +5029,7 @@ static int scan_folios(struct lruvec *lr scanned += delta; - if (sort_folio(lruvec, folio, tier)) + if (sort_folio(lruvec, folio, sc, tier)) sorted += delta; else if (isolate_folio(lruvec, folio, sc)) { list_add(&folio->lru, list); _ Patches currently in -mm which might be from kaleshsingh(a)google.com are

1 year, 11 months

1
0
0 0

[merged mm-hotfixes-stable] mm-multi-gen-lru-dont-spin-during-memcg-release.patch removed from -mm tree

by Andrew Morton

The quilt patch titled Subject: mm: multi-gen LRU: don't spin during memcg release has been removed from the -mm tree. Its filename was mm-multi-gen-lru-dont-spin-during-memcg-release.patch This patch was dropped because it was merged into the mm-hotfixes-stable branch of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm ------------------------------------------------------ From: "T.J. Mercier" <tjmercier(a)google.com> Subject: mm: multi-gen LRU: don't spin during memcg release Date: Mon, 14 Aug 2023 15:16:36 +0000 When a memcg is in the process of being released mem_cgroup_tryget will fail because its reference count has already reached 0. This can happen during reclaim if the memcg has already been offlined, and we reclaim all remaining pages attributed to the offlined memcg. shrink_many attempts to skip the empty memcg in this case, and continue reclaiming from the remaining memcgs in the old generation. If there is only one memcg remaining, or if all remaining memcgs are in the process of being released then shrink_many will spin until all memcgs have finished being released. The release occurs through a workqueue, so it can take a while before kswapd is able to make any further progress. This fix results in reductions in kswapd activity and direct reclaim in a test where 28 apps (working set size > total memory) are repeatedly launched in a random sequence: A B delta ratio(%) allocstall_movable 5962 3539 -2423 -40.64 allocstall_normal 2661 2417 -244 -9.17 kswapd_high_wmark_hit_quickly 53152 7594 -45558 -85.71 pageoutrun 57365 11750 -45615 -79.52 Link: https://lkml.kernel.org/r/20230814151636.1639123-1-tjmercier@google.com Fixes: e4dde56cd208 ("mm: multi-gen LRU: per-node lru_gen_folio lists") Signed-off-by: T.J. Mercier <tjmercier(a)google.com> Acked-by: Yu Zhao <yuzhao(a)google.com> Cc: <stable(a)vger.kernel.org> Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org> --- mm/vmscan.c | 13 ++++++++----- 1 file changed, 8 insertions(+), 5 deletions(-) --- a/mm/vmscan.c~mm-multi-gen-lru-dont-spin-during-memcg-release +++ a/mm/vmscan.c @@ -4854,16 +4854,17 @@ void lru_gen_release_memcg(struct mem_cg spin_lock_irq(&pgdat->memcg_lru.lock); - VM_WARN_ON_ONCE(hlist_nulls_unhashed(&lruvec->lrugen.list)); + if (hlist_nulls_unhashed(&lruvec->lrugen.list)) + goto unlock; gen = lruvec->lrugen.gen; - hlist_nulls_del_rcu(&lruvec->lrugen.list); + hlist_nulls_del_init_rcu(&lruvec->lrugen.list); pgdat->memcg_lru.nr_memcgs[gen]--; if (!pgdat->memcg_lru.nr_memcgs[gen] && gen == get_memcg_gen(pgdat->memcg_lru.seq)) WRITE_ONCE(pgdat->memcg_lru.seq, pgdat->memcg_lru.seq + 1); - +unlock: spin_unlock_irq(&pgdat->memcg_lru.lock); } } @@ -5435,8 +5436,10 @@ restart: rcu_read_lock(); hlist_nulls_for_each_entry_rcu(lrugen, pos, &pgdat->memcg_lru.fifo[gen][bin], list) { - if (op) + if (op) { lru_gen_rotate_memcg(lruvec, op); + op = 0; + } mem_cgroup_put(memcg); @@ -5444,7 +5447,7 @@ restart: memcg = lruvec_memcg(lruvec); if (!mem_cgroup_tryget(memcg)) { - op = 0; + lru_gen_release_memcg(memcg); memcg = NULL; continue; } _ Patches currently in -mm which might be from tjmercier(a)google.com are

1 year, 11 months

1
0
0 0

2025

2024

2023

2022

2021

2020

2019

2018

2017

Linux-stable-mirror August 2023