- Linux-stable-mirror - lists.linaro.org

[PATCH 0/4] [3.16.y] thp03 LTP test panic the kernel after applying the l1tf patches

by Wenkuan Wang

For stable tree 3.16.y, as regarding the page set into PAGE_NONE, the PFN will be inverted, when reference it by pmd_page, it needs to be inverted again controlling by protnone_mask(pfn). https://github.com/linux-test-project/ltp/blob/master/testcases/kernel/mem/… This LTP test case thp03 will get the kernel OOPS like bellow, and it could be reproduced every time. BUG: unable to handle kernel paging request at ffffeafffd330000 IP: [<ffffffff8117f109>] __split_huge_page_pmd+0xc9/0x270 PGD 0 Oops: 0000 [#1] PREEMPT SMP Modules linked in: iTCO_wdt iTCO_vendor_support x86_pkg_temp_thermal intel_powerclamp ioatdma coretemp crct10dif_pclmul crct10dif_common aesni_intel aes_x86_64 glue_helper lrw gf128mul ablk_helper cryptd sb_edac edac_core lpc_ich i2c_i801 dca CPU: 0 PID: 610 Comm: thp03 Not tainted 3.14.39ltsi-WR7.0.0.28_standard #1 Hardware name: Intel Corporation SandyBridge Platform/To be filled by O.E.M., BIOS CCFRCLC0.019.1308201516 08/20/2013 task: ffff8800b5a83040 ti: ffff880138b94000 task.ti: ffff880138b94000 RIP: 0010:[<ffffffff8117f109>] [<ffffffff8117f109>] __split_huge_page_pmd+0xc9/0x270 RSP: 0018:ffff880138b97d08 EFLAGS: 00010286 RAX: 0000000000000000 RBX: ffff8801b8bd7220 RCX: 0000000000000008 RDX: 000000fffd330000 RSI: 00007f30c8800000 RDI: 0000000000000001 RBP: ffff880138b97d48 R08: ffff880138b29980 R09: ffff880138bd7220 R10: 00000007f30c86c0 R11: ffffea0004e2f5f0 R12: 00007f30c8a00000 R13: ffffeafffd330000 R14: ffffea0000000000 R15: 00007f30c8800000 FS: 00007f30c8ec1700(0000) GS:ffff88013b600000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: ffffeafffd330000 CR3: 0000000236ca8000 CR4: 0000000000160770 Stack: ffff880138b29980 ffff880138bd7220 ffffea0004e2f5f0 00007f30c88c0000 ffff880138bd7220 00007f30c88c0000 ffff880138b289c0 0000000000000000 ffff880138b97d68 ffffffff8117fda6 ffff880237391f80 00007f30c88c0000 Call Trace: [<ffffffff8117fda6>] split_huge_page_pmd_mm+0x46/0x50 [<ffffffff8117fdda>] split_huge_page_address+0x2a/0x30 [<ffffffff8117fea9>] __vma_adjust_trans_huge+0xc9/0xf0 [<ffffffff81151d05>] vma_adjust+0x6a5/0x710 [<ffffffff81151f55>] __split_vma.isra.33+0x1e5/0x200 [<ffffffff81152d59>] split_vma+0x29/0x30 [<ffffffff81147e56>] SyS_madvise+0x6a6/0x720 [<ffffffff81a92105>] system_call_fastpath+0x26/0x2b After applying these patches: thp03 1 TPASS : system didn't crash, pass. Tom Lendacky (1): x86/mm: Simplify p[g4um]d_page() macros Toshi Kani (3): x86/asm: Add pud/pmd mask interfaces to handle large PAT bit x86/asm: Move PUD_PAGE macros to page_types.h x86/asm: Fix pud/pmd interfaces to handle large PAT bit arch/x86/include/asm/page_64_types.h | 3 --- arch/x86/include/asm/page_types.h | 3 +++ arch/x86/include/asm/pgtable.h | 19 ++++++++++------- arch/x86/include/asm/pgtable_types.h | 40 ++++++++++++++++++++++++++++++++---- 4 files changed, 51 insertions(+), 14 deletions(-) -- 1.8.3.1

6 years, 8 months

2
7
0 0

[PATCH] ACPI / power: Skip duplicate power-resources in PR tables

by Hans de Goede

Some ACPI tables contain duplicate resources like this: Name (_PR0, Package (0x04) // _PR0: Power Resources for D0 { P28P, P18P, P18P, CLK4 }) This causes a WARN_ON in sysfs_add_link_to_group because we end up adding a link to the same acpi_device twice: sysfs: cannot create duplicate filename '/devices/LNXSYSTM:00/LNXSYBUS:00/PNP0A08:00/808622C1:00/OVTI2680:00/power_resources_D0/LNXPOWER:0a' CPU: 0 PID: 1 Comm: swapper/0 Not tainted 4.19.12-301.fc29.x86_64 #1 Hardware name: Insyde CherryTrail/Type2 - Board Product Name, BIOS jumperx.T87.KFBNEEA02 04/13/2016 Call Trace: dump_stack+0x5c/0x80 sysfs_warn_dup.cold.3+0x17/0x2a sysfs_do_create_link_sd.isra.2+0xa9/0xb0 sysfs_add_link_to_group+0x30/0x50 acpi_power_expose_list+0x74/0xa0 acpi_power_add_remove_device+0x50/0xa0 acpi_add_single_object+0x26b/0x5f0 acpi_bus_check_add+0xc4/0x250 ... This commit makes acpi_extract_power_resources() check for duplicates and simply skips them when found, fixing this. Cc: stable(a)vger.kernel.org Signed-off-by: Hans de Goede <hdegoede(a)redhat.com> --- drivers/acpi/power.c | 22 ++++++++++++++++++++++ 1 file changed, 22 insertions(+) diff --git a/drivers/acpi/power.c b/drivers/acpi/power.c index 1b475bc1ae16..bc10a5efc1ba 100644 --- a/drivers/acpi/power.c +++ b/drivers/acpi/power.c @@ -131,6 +131,23 @@ void acpi_power_resources_list_free(struct list_head *list) } } +static bool acpi_power_resource_is_dup(union acpi_object *package, + unsigned int start, unsigned int i) +{ + acpi_handle rhandle, dup; + unsigned int j; + + /* Note the element types have already been checked in our caller */ + rhandle = package->package.elements[i].reference.handle; + for (j = start; j < i; j++) { + dup = package->package.elements[j].reference.handle; + if (dup == rhandle) + return true; + } + + return false; +} + int acpi_extract_power_resources(union acpi_object *package, unsigned int start, struct list_head *list) { @@ -150,6 +167,11 @@ int acpi_extract_power_resources(union acpi_object *package, unsigned int start, err = -ENODEV; break; } + + /* Some ACPI tables contain duplicate resources, skip these */ + if (acpi_power_resource_is_dup(package, start, i)) + continue; + err = acpi_add_power_resource(rhandle); if (err) break; -- 2.20.1

6 years, 8 months

2
1
0 0

[regression] USB power management failure to suspend / high CPU usage

by Eric Blau

Hi folks, I noticed a regression introduced sometime after 4.19.4 in USB power management. I have a 2015 MacBook Pro. When I try to do a suspend or a suspend+hibernate, I get the following error messages trying to suspend usb2 and the suspend fails. This works fine in 4.19.4: Dec 22 13:50:36 eric-macbookpro kernel: Freezing remaining freezable tasks ... (elapsed 0.001 seconds) done. Dec 22 13:50:36 eric-macbookpro kernel: Suspending console(s) (use no_console_suspend to debug) Dec 22 13:50:36 eric-macbookpro kernel: dpm_run_callback(): usb_dev_freeze+0x0/0x10 returns -16 Dec 22 13:50:36 eric-macbookpro kernel: PM: Device usb2 failed to freeze async: error -16 Dec 22 13:50:38 eric-macbookpro systemd[1]: systemd-hybrid-sleep.service: Main process exited, code=exited, status=1/FAILURE Dec 22 13:50:38 eric-macbookpro systemd[1]: systemd-hybrid-sleep.service: Failed with result 'exit-code'. Dec 22 13:50:38 eric-macbookpro systemd[1]: Failed to start Hybrid Suspend+Hibernate. Dec 22 13:50:38 eric-macbookpro systemd[1]: Dependency failed for Hybrid Suspend+Hibernate. Dec 22 13:50:38 eric-macbookpro systemd[1]: hybrid-sleep.target: Job hybrid-sleep.target/start failed with result 'dependency'. Dec 22 13:50:38 eric-macbookpro systemd-logind[1573]: Operation 'sleep' finished. Dec 22 13:50:38 eric-macbookpro systemd[1]: Stopped target Sleep. The behavior exists in 4.19.8 and 4.19.11, the kernel versions I have upgraded to with Arch Linux, so the regression was introduced sometime between 4.19.4 and 4.19.8. Hibernate still works but when I resume from hibernate, there is a ksoftirqd and kworker thread/process together taking up 100% of one core. If I turn off auto power control for usb1 and usb2, the threads stop spinning. i.e., echo 'on' > '/sys/bus/usb/devices/usb1/power/control Any suggestions as to where this regression was introduced and what can be done to fix it? Thanks, Eric

6 years, 8 months

2
4
0 0

[PATCH v2 1/1] scsi: Synchronize request queue PM status only on successful resume

by stanley.chu＠mediatek.com

From: Stanley Chu <stanley.chu(a)mediatek.com> The commit 356fd2663cff ("scsi: Set request queue runtime PM status back to active on resume") fixed up the inconsistent RPM status between request queue and device. However changing request queue RPM status shall be done only on successful resume, otherwise status may be still inconsistent as below, Request queue: RPM_ACTIVE Device: RPM_SUSPENDED This ends up soft lockup because requests can be submitted to underlying devices but those devices and their required resource are not resumed. Fixes: 356fd2663cff ("scsi: Set request queue runtime PM status back to active on resume") Cc: stable(a)vger.kernel.org Signed-off-by: Stanley Chu <stanley.chu(a)mediatek.com> --- drivers/scsi/scsi_pm.c | 26 +++++++++++++++----------- 1 file changed, 15 insertions(+), 11 deletions(-) diff --git a/drivers/scsi/scsi_pm.c b/drivers/scsi/scsi_pm.c index a2b4179bfdf7..7639df91b110 100644 --- a/drivers/scsi/scsi_pm.c +++ b/drivers/scsi/scsi_pm.c @@ -80,8 +80,22 @@ static int scsi_dev_type_resume(struct device *dev, if (err == 0) { pm_runtime_disable(dev); - pm_runtime_set_active(dev); + err = pm_runtime_set_active(dev); pm_runtime_enable(dev); + + /* + * Forcibly set runtime PM status of request queue to "active" + * to make sure we can again get requests from the queue + * (see also blk_pm_peek_request()). + * + * The resume hook will correct runtime PM status of the disk. + */ + if (!err && scsi_is_sdev_device(dev)) { + struct scsi_device *sdev = to_scsi_device(dev); + + if (sdev->request_queue->dev) + blk_set_runtime_active(sdev->request_queue); + } } return err; @@ -140,16 +154,6 @@ static int scsi_bus_resume_common(struct device *dev, else fn = NULL; - /* - * Forcibly set runtime PM status of request queue to "active" to - * make sure we can again get requests from the queue (see also - * blk_pm_peek_request()). - * - * The resume hook will correct runtime PM status of the disk. - */ - if (scsi_is_sdev_device(dev) && pm_runtime_suspended(dev)) - blk_set_runtime_active(to_scsi_device(dev)->request_queue); - if (fn) { async_schedule_domain(fn, dev, &scsi_sd_pm_domain); -- 2.18.0

6 years, 8 months

3
2
0 0

[PATCH] soc: fsl: qbman: avoid race in clearing QMan interrupt

by Madalin-cristian Bucur

By clearing all interrupt sources, not only those that already occurred, the existing code may acknowledge by mistake interrupts that occurred after the code checks for them. Signed-off-by: Madalin Bucur <madalin.bucur(a)nxp.com> Signed-off-by: Roy Pledge <roy.pledge(a)nxp.com> --- drivers/soc/fsl/qbman/qman.c | 9 +++++---- 1 file changed, 5 insertions(+), 4 deletions(-) diff --git a/drivers/soc/fsl/qbman/qman.c b/drivers/soc/fsl/qbman/qman.c index 52c153cd795a..636f83f781f5 100644 --- a/drivers/soc/fsl/qbman/qman.c +++ b/drivers/soc/fsl/qbman/qman.c @@ -1143,18 +1143,19 @@ static void qm_mr_process_task(struct work_struct *work); static irqreturn_t portal_isr(int irq, void *ptr) { struct qman_portal *p = ptr; - - u32 clear = QM_DQAVAIL_MASK | p->irq_sources; u32 is = qm_in(&p->p, QM_REG_ISR) & p->irq_sources; + u32 clear = 0; if (unlikely(!is)) return IRQ_NONE; /* DQRR-handling if it's interrupt-driven */ - if (is & QM_PIRQ_DQRI) + if (is & QM_PIRQ_DQRI) { __poll_portal_fast(p, QMAN_POLL_LIMIT); + clear = QM_DQAVAIL_MASK | QM_PIRQ_DQRI; + } /* Handling of anything else that's interrupt-driven */ - clear |= __poll_portal_slow(p, is); + clear |= __poll_portal_slow(p, is) & QM_PIRQ_SLOW; qm_out(&p->p, QM_REG_ISR, clear); return IRQ_HANDLED; } -- 2.1.0

6 years, 8 months

1
0
0 0

[PATCH 2/2] staging: rtl8188eu: Fix module loading from tasklet for WEP encryption

by Larry Finger

Commit 2b2ea09e74a5 ("staging:r8188eu: Use lib80211 to decrypt WEP-frames") causes scheduling while atomic bugs followed by a hard freeze whenever the driver tries to connect to a WEP-encrypted network. Experimentation showed that the freezes were eliminated when module lib80211 was preloaded, which can be forced by calling lib80211_get_crypto_ops() directly rather than indirectly through try_then_request_module(). With this change, no BUG messages are logged. Fixes: 2b2ea09e74a5 ("staging:r8188eu: Use lib80211 to decrypt WEP-frames") Cc: Stable <stable(a)vger.kernel.org> # v4.17+ Cc: Michael Straube <straube.linux(a)gmail.com> Cc: Ivan Safonov <insafonov(a)gmail.com> Signed-off-by: Larry Finger <Larry.Finger(a)lwfinger.net> --- drivers/staging/rtl8188eu/core/rtw_security.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/drivers/staging/rtl8188eu/core/rtw_security.c b/drivers/staging/rtl8188eu/core/rtw_security.c index 052656a22821..bab96c870042 100644 --- a/drivers/staging/rtl8188eu/core/rtw_security.c +++ b/drivers/staging/rtl8188eu/core/rtw_security.c @@ -154,7 +154,7 @@ void rtw_wep_encrypt(struct adapter *padapter, u8 *pxmitframe) pframe = ((struct xmit_frame *)pxmitframe)->buf_addr + hw_hdr_offset; - crypto_ops = try_then_request_module(lib80211_get_crypto_ops("WEP"), "lib80211_crypt_wep"); + crypto_ops = lib80211_get_crypto_ops("WEP"); if (!crypto_ops) return; @@ -210,7 +210,7 @@ int rtw_wep_decrypt(struct adapter *padapter, u8 *precvframe) void *crypto_private = NULL; int status = _SUCCESS; const int keyindex = prxattrib->key_index; - struct lib80211_crypto_ops *crypto_ops = try_then_request_module(lib80211_get_crypto_ops("WEP"), "lib80211_crypt_wep"); + struct lib80211_crypto_ops *crypto_ops = lib80211_get_crypto_ops("WEP"); char iv[4], icv[4]; if (!crypto_ops) { -- 2.16.4

6 years, 8 months

1
0
0 0

[PATCH 1/2] staging: rtl8188eu: Fix module loading from tasklet for CCMP encryption

by Larry Finger

Commit 6bd082af7e36 ("staging:r8188eu: use lib80211 CCMP decrypt") causes scheduling while atomic bugs followed by a hard freeze whenever the driver tries to connect to a CCMP-encrypted network. Experimentation showed that the freezes were eliminated when module lib80211 was preloaded, which can be forced by calling lib80211_get_crypto_ops() directly rather than indirectly through try_then_request_module(). With this change, no BUG messages are logged. Fixes: 6bd082af7e36 ("staging:r8188eu: use lib80211 CCMP decrypt") Cc: Stable <stable(a)vger.kernel.org> # v4.17+ Reported-and-tested-by: Michael Straube <straube.linux(a)gmail.com> Cc: Ivan Safonov <insafonov(a)gmail.com> Signed-off-by: Larry Finger <Larry.Finger(a)lwfinger.net> --- drivers/staging/rtl8188eu/core/rtw_security.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/staging/rtl8188eu/core/rtw_security.c b/drivers/staging/rtl8188eu/core/rtw_security.c index f7407632e80b..052656a22821 100644 --- a/drivers/staging/rtl8188eu/core/rtw_security.c +++ b/drivers/staging/rtl8188eu/core/rtw_security.c @@ -1291,7 +1291,7 @@ u32 rtw_aes_decrypt(struct adapter *padapter, u8 *precvframe) struct sk_buff *skb = ((struct recv_frame *)precvframe)->pkt; void *crypto_private = NULL; u8 *key, *pframe = skb->data; - struct lib80211_crypto_ops *crypto_ops = try_then_request_module(lib80211_get_crypto_ops("CCMP"), "lib80211_crypt_ccmp"); + struct lib80211_crypto_ops *crypto_ops = lib80211_get_crypto_ops("CCMP"); struct security_priv *psecuritypriv = &padapter->securitypriv; char iv[8], icv[8]; -- 2.16.4

6 years, 8 months

1
0
0 0

[PATCH] x86/retpoline: change RETPOLINE into CONFIG_RETPOLINE

by Nadav Amit

A recent enhancement intentionally fails the kernel build if the compiler does not support retpolines and CONFIG_RETPOLINE is set. However, the patch that introduced it did not change RETPOLINE macro references into CONFIG_RETPOLINE ones. As a result, indirect branches that are used by init functions are not kept (i.e., they use retpolines), and modules that do not use retpolines are marked as retpoline-safe. Fix it be changing RETPOLINE into CONFIG_RETPOLINE. Fixes: 4cd24de3a098 ("x86/retpoline: Make CONFIG_RETPOLINE depend on compiler support") Cc: Peter Zijlstra <peterz(a)infradead.org> Cc: Zhenzhong Duan <zhenzhong.duan(a)oracle.com> Cc: Thomas Gleixner <tglx(a)linutronix.de> Cc: David Woodhouse <dwmw(a)amazon.co.uk> Cc: Andy Lutomirski <luto(a)kernel.org> Cc: Masahiro Yamada <yamada.masahiro(a)socionext.com> Cc: stable(a)vger.kernel.org Signed-off-by: Nadav Amit <namit(a)vmware.com> --- arch/x86/kernel/cpu/bugs.c | 2 +- include/linux/compiler-gcc.h | 2 +- include/linux/module.h | 2 +- 3 files changed, 3 insertions(+), 3 deletions(-) diff --git a/arch/x86/kernel/cpu/bugs.c b/arch/x86/kernel/cpu/bugs.c index 8654b8b0c848..1de0f4170178 100644 --- a/arch/x86/kernel/cpu/bugs.c +++ b/arch/x86/kernel/cpu/bugs.c @@ -215,7 +215,7 @@ static enum spectre_v2_mitigation spectre_v2_enabled __ro_after_init = static enum spectre_v2_user_mitigation spectre_v2_user __ro_after_init = SPECTRE_V2_USER_NONE; -#ifdef RETPOLINE +#ifdef CONFIG_RETPOLINE static bool spectre_v2_bad_module; bool retpoline_module_ok(bool has_retpoline) diff --git a/include/linux/compiler-gcc.h b/include/linux/compiler-gcc.h index 2010493e1040..977ddf2774f9 100644 --- a/include/linux/compiler-gcc.h +++ b/include/linux/compiler-gcc.h @@ -68,7 +68,7 @@ */ #define uninitialized_var(x) x = x -#ifdef RETPOLINE +#ifdef CONFIG_RETPOLINE #define __noretpoline __attribute__((__indirect_branch__("keep"))) #endif diff --git a/include/linux/module.h b/include/linux/module.h index fce6b4335e36..0c575f51fe57 100644 --- a/include/linux/module.h +++ b/include/linux/module.h @@ -817,7 +817,7 @@ static inline void module_bug_finalize(const Elf_Ehdr *hdr, static inline void module_bug_cleanup(struct module *mod) {} #endif /* CONFIG_GENERIC_BUG */ -#ifdef RETPOLINE +#ifdef CONFIG_RETPOLINE extern bool retpoline_module_ok(bool has_retpoline); #else static inline bool retpoline_module_ok(bool has_retpoline) -- 2.17.1

6 years, 8 months

2
3
0 0

[PATCH v2] netfilter: xt_connlimit: fix race in connection counting

by Alakesh Haloi

An iptable rule like the following on a multicore systems will result in accepting more connections than set in the rule. iptables -A INPUT -p tcp -m tcp --syn --dport 7777 -m connlimit \ --connlimit-above 2000 --connlimit-mask 0 -j DROP In check_hlist function, connections that are found in saved connections but not in netfilter conntrack are deleted, assuming that those connections do not exist anymore. But for multi core systems, there exists a small time window, when a connection has been added to the xt_connlimit maintained rb-tree but has not yet made to netfilter conntrack table. This causes concurrent connections to return incorrect counts and go over limit set in iptable rule. The fix has been partially backported from the above mentioned upstream commit. Introduce timestamp and the owning cpu. Signed-off-by: Alakesh Haloi <alakeshh(a)amazon.com> Cc: Pablo Neira Ayuso <pablo(a)netfilter.org> Cc: Jozsef Kadlecsik <kadlec(a)blackhole.kfki.hu> Cc: Florian Westphal <fw(a)strlen.de> Cc: "David S. Miller" <davem(a)davemloft.net> Cc: stable(a)vger.kernel.org # v4.15 and before Cc: netdev(a)vger.kernel.org Cc: Dmitry Andrianov <dmitry.andrianov(a)alertme.com> Cc: Justin Pettit <jpettit(a)vmware.com> Cc: Yi-Hung Wei <yihung.wei(a)gmail.com> --- net/netfilter/xt_connlimit.c | 28 ++++++++++++++++++++++++++-- 1 file changed, 26 insertions(+), 2 deletions(-) diff --git a/net/netfilter/xt_connlimit.c b/net/netfilter/xt_connlimit.c index ffa8eec..e7b092b 100644 --- a/net/netfilter/xt_connlimit.c +++ b/net/netfilter/xt_connlimit.c @@ -47,6 +47,8 @@ struct xt_connlimit_conn { struct hlist_node node; struct nf_conntrack_tuple tuple; union nf_inet_addr addr; + int cpu; + u32 jiffies32; }; struct xt_connlimit_rb { @@ -126,6 +128,8 @@ static bool add_hlist(struct hlist_head *head, return false; conn->tuple = *tuple; conn->addr = *addr; + conn->cpu = raw_smp_processor_id(); + conn->jiffies32 = (u32)jiffies; hlist_add_head(&conn->node, head); return true; } @@ -148,8 +152,26 @@ static unsigned int check_hlist(struct net *net, hlist_for_each_entry_safe(conn, n, head, node) { found = nf_conntrack_find_get(net, zone, &conn->tuple); if (found == NULL) { - hlist_del(&conn->node); - kmem_cache_free(connlimit_conn_cachep, conn); + /* If connection is not found, it may be because + * it has not made into conntrack table yet. We + * check if it is a recently created connection + * on a different core and do not delete it in that + * case. + */ + + unsigned long a, b; + int cpu = raw_smp_processor_id(); + __u32 age; + + b = conn->jiffies; + a = (u32)jiffies; + age = a - b; + if (conn->cpu != cpu && age <= 2) { + length++; + } else { + hlist_del(&conn->node); + kmem_cache_free(connlimit_conn_cachep, conn); + } continue; } @@ -271,6 +293,8 @@ static void tree_nodes_free(struct rb_root *root, conn->tuple = *tuple; conn->addr = *addr; + conn->cpu = raw_smp_processor_id(); + conn->jiffies32 = (u32)jiffies; rbconn->addr = *addr; INIT_HLIST_HEAD(&rbconn->hhead); -- 1.8.3.1

6 years, 8 months

1
0
0 0

[PATCH] netfilter: xt_connlimit: fix race in connection counting

by Alakesh Haloi

An iptable rule like the following on a multicore systems will result in accepting more connections than set in the rule. iptables -A INPUT -p tcp -m tcp --syn --dport 7777 -m connlimit \ --connlimit-above 2000 --connlimit-mask 0 -j DROP In check_hlist function, connections that are found in saved connections but not in netfilter conntrack are deleted, assuming that those connections do not exist anymore. But for multi core systems, there exists a small time window, when a connection has been added to the xt_connlimit maintained rb-tree but has not yet made to netfilter conntrack table. This causes concurrent connections to return incorrect counts and go over limit set in iptable rule. Connection 1 on Core 1 Connection 2 on Core 2 list_length = N conntrack_table_len = N spin_lock_bh() In check_hlist() function a. loop over saved connections 1. call nf_conntrack_find_get() 2. If not found in 1, i. call hlist_del() b. return total count to caller c. connection gets added to list of saved connections. spin_unlock_bh() list_length = N + 1 spin_lock_bh() on core 2 In check_hlist() function a. loop over saved connection. 1. call nf_conntrack_find_get() 2. If not found in 1. i. call hlist_del() [Connection 1 was in the but not in nf_conntrack yet] ii. connection 1 gets deleted list_length = N conntrack_table_len = N b. return total count to caller c. connection 2 gets added to list of saved connections spin_unlock_bh() d. connection 1 gets added to nf_conntrack list_length = N + 1 conntrack_table_len = N + 1 e. connection 2 gets added to nf_conntrack list_length = N + 1 conntrack_table_len = N + 2 So we end up with N + 1 connections in the list but N + 2 in nf_conntrack, allowing more number of connections eventually than set in the rule. This fix adds an additional field to track such pending connections and prevent them from being deleted by another execution thread on a different core and returns correct count. Signed-off-by: Alakesh Haloi <alakeshh(a)amazon.com> Cc: Pablo Neira Ayuso <pablo(a)netfilter.org> Cc: Jozsef Kadlecsik <kadlec(a)blackhole.kfki.hu> Cc: Florian Westphal <fw(a)strlen.de> Cc: "David S. Miller" <davem(a)davemloft.net> Cc: stable(a)vger.kernel.org # v4.15 and before --- net/netfilter/xt_connlimit.c | 24 +++++++++++++++++++++--- 1 file changed, 21 insertions(+), 3 deletions(-) diff --git a/net/netfilter/xt_connlimit.c b/net/netfilter/xt_connlimit.c index ffa8eec980e9..bd7563c209a4 100644 --- a/net/netfilter/xt_connlimit.c +++ b/net/netfilter/xt_connlimit.c @@ -47,6 +47,7 @@ struct xt_connlimit_conn { struct hlist_node node; struct nf_conntrack_tuple tuple; union nf_inet_addr addr; + bool pending_add; }; struct xt_connlimit_rb { @@ -126,6 +127,7 @@ static bool add_hlist(struct hlist_head *head, return false; conn->tuple = *tuple; conn->addr = *addr; + conn->pending_add = true; hlist_add_head(&conn->node, head); return true; } @@ -144,15 +146,31 @@ static unsigned int check_hlist(struct net *net, *addit = true; - /* check the saved connections */ + /* check the saved connections + */ hlist_for_each_entry_safe(conn, n, head, node) { found = nf_conntrack_find_get(net, zone, &conn->tuple); if (found == NULL) { - hlist_del(&conn->node); - kmem_cache_free(connlimit_conn_cachep, conn); + /* It could be an already deleted connection or + * a new connection that is not there in conntrack + * yet. If former delete it from the list, else + * increase count and move on. + */ + if (conn->pending_add) { + length++; + } else { + hlist_del(&conn->node); + kmem_cache_free(connlimit_conn_cachep, conn); + } continue; } + /* If it is a connection that was pending insertion to + * connection tracking table before, then it's time to clear + * the flag. + */ + conn->pending_add = false; + found_ct = nf_ct_tuplehash_to_ctrack(found); if (nf_ct_tuple_equal(&conn->tuple, tuple)) { -- 2.14.4

6 years, 8 months

4
10
0 0

[PATCH 0/2] x86/mm/pkeys: fix user-visible pkey state destruction at fork()

by Dave Hansen

Hi x86 maintainers, This is an important fix that I believe needs to be merged for 4.21. Without it, applications calling fork() can potentially double-allocate a protection key, causing lots of strange problems. Thomas's Reviewed-by is on the the actual fix, but not the selftest. I would also be happy to send this as a pull request if you would prefer. Cc: x86(a)kernel.org Cc: Thomas Gleixner <tglx(a)linutronix.de> Cc: Ingo Molnar <mingo(a)redhat.com> Cc: Borislav Petkov <bp(a)alien8.de> Cc: "H. Peter Anvin" <hpa(a)zytor.com> Cc: Peter Zijlstra <peterz(a)infradead.org> Cc: Michael Ellerman <mpe(a)ellerman.id.au> Cc: Will Deacon <will.deacon(a)arm.com> Cc: Andy Lutomirski <luto(a)kernel.org> Cc: Joerg Roedel <jroedel(a)suse.de> Cc: stable(a)vger.kernel.org

6 years, 8 months

1
2
0 0

+ slab-alien-caches-must-not-be-initialized-if-the-allocation-of-the-alien-cache-failed.patch added to -mm tree

by akpm＠linux-foundation.org

The patch titled Subject: slab: alien caches must not be initialized if the allocation of the alien cache failed has been added to the -mm tree. Its filename is slab-alien-caches-must-not-be-initialized-if-the-allocation-of-the-alien-cache-failed.patch This patch should soon appear at http://ozlabs.org/~akpm/mmots/broken-out/slab-alien-caches-must-not-be-init… and later at http://ozlabs.org/~akpm/mmotm/broken-out/slab-alien-caches-must-not-be-init… Before you just go and hit "reply", please: a) Consider who else should be cc'ed b) Prefer to cc a suitable mailing list as well c) Ideally: find the original patch on the mailing list and do a reply-to-all to that, adding suitable additional cc's *** Remember to use Documentation/process/submit-checklist.rst when testing your code *** The -mm tree is included into linux-next and is updated there every 3-4 working days ------------------------------------------------------ From: Christoph Lameter <cl(a)linux.com> Subject: slab: alien caches must not be initialized if the allocation of the alien cache failed Callers of __alloc_alien() check for NULL. We must do the same check in __alloc_alien_cache to avoid NULL pointer dereferences on allocation failures. Link: http://lkml.kernel.org/r/010001680f42f192-82b4e12e-1565-4ee0-ae1f-1e9897490… Signed-off-by: Christoph Lameter <cl(a)linux.com> Reported-by: syzbot+d6ed4ec679652b4fd4e4(a)syzkaller.appspotmail.com Reviewed-by: Andrew Morton <akpm(a)linux-foundation.org> Cc: Pekka Enberg <penberg(a)kernel.org> Cc: David Rientjes <rientjes(a)google.com> Cc: Joonsoo Kim <iamjoonsoo.kim(a)lge.com> Cc: <stable(a)vger.kernel.org> Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org> --- mm/slab.c | 6 ++++-- 1 file changed, 4 insertions(+), 2 deletions(-) --- a/mm/slab.c~slab-alien-caches-must-not-be-initialized-if-the-allocation-of-the-alien-cache-failed +++ a/mm/slab.c @@ -666,8 +666,10 @@ static struct alien_cache *__alloc_alien struct alien_cache *alc = NULL; alc = kmalloc_node(memsize, gfp, node); - init_arraycache(&alc->ac, entries, batch); - spin_lock_init(&alc->lock); + if (alc) { + init_arraycache(&alc->ac, entries, batch); + spin_lock_init(&alc->lock); + } return alc; } _ Patches currently in -mm which might be from cl(a)linux.com are slab-alien-caches-must-not-be-initialized-if-the-allocation-of-the-alien-cache-failed.patch

6 years, 8 months

1
0
0 0

+ fork-memcg-fix-cached_stacks-case.patch added to -mm tree

by akpm＠linux-foundation.org

The patch titled Subject: fork, memcg: fix cached_stacks case has been added to the -mm tree. Its filename is fork-memcg-fix-cached_stacks-case.patch This patch should soon appear at http://ozlabs.org/~akpm/mmots/broken-out/fork-memcg-fix-cached_stacks-case.… and later at http://ozlabs.org/~akpm/mmotm/broken-out/fork-memcg-fix-cached_stacks-case.… Before you just go and hit "reply", please: a) Consider who else should be cc'ed b) Prefer to cc a suitable mailing list as well c) Ideally: find the original patch on the mailing list and do a reply-to-all to that, adding suitable additional cc's *** Remember to use Documentation/process/submit-checklist.rst when testing your code *** The -mm tree is included into linux-next and is updated there every 3-4 working days ------------------------------------------------------ From: Shakeel Butt <shakeelb(a)google.com> Subject: fork, memcg: fix cached_stacks case 5eed6f1dff87 ("fork,memcg: fix crash in free_thread_stack on memcg charge fail") fixes a crash caused due to failed memcg charge of the kernel stack. However the fix misses the cached_stacks case which this patch fixes. So, the same crash can happen if the memcg charge of a cached stack is failed. Link: http://lkml.kernel.org/r/20190102180145.57406-1-shakeelb@google.com Fixes: 5eed6f1dff87 ("fork,memcg: fix crash in free_thread_stack on memcg charge fail") Signed-off-by: Shakeel Butt <shakeelb(a)google.com> Cc: Rik van Riel <riel(a)surriel.com> Cc: Roman Gushchin <guro(a)fb.com> Cc: Michal Hocko <mhocko(a)suse.com> Cc: Johannes Weiner <hannes(a)cmpxchg.org> Cc: Tejun Heo <tj(a)kernel.org> Cc: <stable(a)vger.kernel.org> Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org> --- kernel/fork.c | 1 + 1 file changed, 1 insertion(+) --- a/kernel/fork.c~fork-memcg-fix-cached_stacks-case +++ a/kernel/fork.c @@ -221,6 +221,7 @@ static unsigned long *alloc_thread_stack memset(s->addr, 0, THREAD_SIZE); tsk->stack_vm_area = s; + tsk->stack = s->addr; return s->addr; } _ Patches currently in -mm which might be from shakeelb(a)google.com are fork-memcg-fix-cached_stacks-case.patch

6 years, 8 months

1
0
0 0

[PATCH 4.14 0/4] netfilter: xt_connlimit: backport upstream fixes for race in connection counting

by Mauricio Faria de Oliveira

Recently, Alakesh Haloi reported the following issue [1] with stable/4.14: """ An iptable rule like the following on a multicore systems will result in accepting more connections than set in the rule. iptables -A INPUT -p tcp -m tcp --syn --dport 7777 -m connlimit \ --connlimit-above 2000 --connlimit-mask 0 -j DROP """ And proposed a fix that is not in Linus's tree. The discussion went on to confirm whether the issue was still reproducible with mainline/nf.git tip, and to either identify the upstream fix or re-submit the non-upstream fix. Alakesh eventually was able to test with upstream, and reported that issue was still reproducible [2]. On that, our findinds diverge, at least in my test environment: First, I verified that the suggested mainline fix for the issue [3] indeed fixes it, by testing with it applied and reverted on v4.18, a clean revert. (The issue is reproducible with the commit reverted). Then, with a consistent reproducer, I moved to nf.git, with HEAD on commit a007232 ("netfilter: nf_conncount: fix argument order to find_next_bit"), and the issues was not reproducible (even with 20+ threads on client side, the number Alakesh reported to achieve 2150+ connections [4], and I tried spreading the network interface IRQ affinity over more and more CPUs too.) Either way, the suggested mainline fix does actually fix the issue in 4.14 for at least one environment. So, it might well be the case that Alakesh's test environment has differences/subtleties that leads to more connections accepted, and more commits are needed for that particular environment type. But for now, with one bare-metal environment (24-core server, 4-core client) verified, I thought of submitting the patches for review/comments/testing, then looking for additional fixes for that environment separately. The fix is PATCH 4/4, and PATCHes 1-3/4 are helpers for a cleaner backport. All backports are simple, and essentially consist of refresh context lines and use older struct/file names. Reviews from netfilter maintainers are very appreciated, as I've no previous experience in this area, and although the backports look simple and build/run correctly, there's usually stuff that only more experienced people may notice. Thanks, Mauricio Links: ===== [1] https://www.spinics.net/lists/stable/msg270040.html [2] https://www.spinics.net/lists/stable/msg273669.html [3] https://www.spinics.net/lists/stable/msg271300.html [4] https://www.spinics.net/lists/stable/msg273669.html Test-case: ========= - v4.14.91 (original): client achieves 2000+ connections (6000 target) with 3 threads. server # iptables -F server # iptables -A INPUT -p tcp -m tcp --syn --dport 7777 -m connlimit --connlimit-above 2000 --connlimit-mask 0 -j DROP server # iptables -L Chain INPUT (policy ACCEPT) target prot opt source destination DROP tcp -- anywhere anywhere tcp dpt:7777 flags:FIN,SYN,RST,ACK/SYN #conn src/0 > 2000 Chain FORWARD (policy ACCEPT) target prot opt source destination Chain OUTPUT (policy ACCEPT) target prot opt source destination server # ulimit -SHn 65000 server # ruby server.rb <... listening ...> client # ulimit -SHn 65000 client # ruby client.rb 10.230.56.100 7777 6000 3 Connecting to ["10.230.56.100"]:7777 6000 times with 3 1 2 3 <...> 2000 <...> 6000 Target reached. Thread finishing 6001 Target reached. Thread finishing 6002 Target reached. Thread finishing Threads done. 6002 connections press enter to exit - v4.14.91 + patches: client only achieved 2000 connections. server # (same procedure) client # (same procedure) Connecting to ["10.230.56.100"]:7777 6000 times with 3 1 2 3 <...> 2000 <... blocked for a while...> failed to create connection: Connection timed out - connect(2) for "10.230.56.100" port 7777 failed to create connection: Connection timed out - connect(2) for "10.230.56.100" port 7777 failed to create connection: Connection timed out - connect(2) for "10.230.56.100" port 7777 Threads done. 2000 connections press enter to exit Florian Westphal (2): netfilter: xt_connlimit: don't store address in the conn nodes netfilter: nf_conncount: fix garbage collection confirm race Pablo Neira Ayuso (1): netfilter: nf_conncount: expose connection list interface Yi-Hung Wei (1): netfilter: nf_conncount: Fix garbage collection with zones include/net/netfilter/nf_conntrack_count.h | 15 +++++ net/netfilter/xt_connlimit.c | 99 +++++++++++++++++++++++------- 2 files changed, 91 insertions(+), 23 deletions(-) create mode 100644 include/net/netfilter/nf_conntrack_count.h -- 2.7.4

6 years, 8 months

2
8
0 0

[merged] lib-dont-depend-on-linux-headers-being-installed.patch removed from -mm tree

by akpm＠linux-foundation.org

The patch titled Subject: lib/gen_crc64table.c: don't depend on linux headers being installed has been removed from the -mm tree. Its filename was lib-dont-depend-on-linux-headers-being-installed.patch This patch was dropped because it was merged into mainline or a subsystem tree ------------------------------------------------------ From: NeilBrown <neil(a)brown.name> Subject: lib/gen_crc64table.c: don't depend on linux headers being installed gen_crc64table requires linux include files to be installed in /usr/include/linux. This is a new requrement so hosts that could previously build the kernel, now cannot. gen_crc64table makes this requirement by including <linux/swab.h>, but nothing from that header is actually used. So remove the #include, so that the linux headers no longer need to be installed. Link: http://lkml.kernel.org/r/87y3899gzi.fsf@notabene.neil.brown.name Fixes: feba04fd2cf8 ("lib: add crc64 calculation routines") Signed-off-by: NeilBrown <neil(a)brown.name> Acked-by: Coly Li <colyli(a)suse.de> Cc: <stable(a)vger.kernel.org> Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org> --- --- a/lib/gen_crc64table.c~lib-dont-depend-on-linux-headers-being-installed +++ a/lib/gen_crc64table.c @@ -16,8 +16,6 @@ #include <inttypes.h> #include <stdio.h> -#include <linux/swab.h> - #define CRC64_ECMA182_POLY 0x42F0E1EBA9EA3693ULL static uint64_t crc64_table[256] = {0}; _ Patches currently in -mm which might be from neil(a)brown.name are

6 years, 8 months

1
0
0 0

[merged] memcg-oom-notify-on-oom-killer-invocation-from-the-charge-path.patch removed from -mm tree

by akpm＠linux-foundation.org

The patch titled Subject: memcg, oom: notify on oom killer invocation from the charge path has been removed from the -mm tree. Its filename was memcg-oom-notify-on-oom-killer-invocation-from-the-charge-path.patch This patch was dropped because it was merged into mainline or a subsystem tree ------------------------------------------------------ From: Michal Hocko <mhocko(a)suse.com> Subject: memcg, oom: notify on oom killer invocation from the charge path Burt Holzman has noticed that memcg v1 doesn't notify about OOM events via eventfd anymore. The reason is that 29ef680ae7c2 ("memcg, oom: move out_of_memory back to the charge path") has moved the oom handling back to the charge path. While doing so the notification was left behind in mem_cgroup_oom_synchronize. Fix the issue by replicating the oom hierarchy locking and the notification. Link: http://lkml.kernel.org/r/20181224091107.18354-1-mhocko@kernel.org Fixes: 29ef680ae7c2 ("memcg, oom: move out_of_memory back to the charge path") Signed-off-by: Michal Hocko <mhocko(a)suse.com> Reported-by: Burt Holzman <burt(a)fnal.gov> Acked-by: Johannes Weiner <hannes(a)cmpxchg.org> Cc: Vladimir Davydov <vdavydov.dev(a)gmail.com Cc: <stable(a)vger.kernel.org> [4.19+] Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org> --- --- a/mm/memcontrol.c~memcg-oom-notify-on-oom-killer-invocation-from-the-charge-path +++ a/mm/memcontrol.c @@ -1673,6 +1673,9 @@ enum oom_status { static enum oom_status mem_cgroup_oom(struct mem_cgroup *memcg, gfp_t mask, int order) { + enum oom_status ret; + bool locked; + if (order > PAGE_ALLOC_COSTLY_ORDER) return OOM_SKIPPED; @@ -1707,10 +1710,23 @@ static enum oom_status mem_cgroup_oom(st return OOM_ASYNC; } + mem_cgroup_mark_under_oom(memcg); + + locked = mem_cgroup_oom_trylock(memcg); + + if (locked) + mem_cgroup_oom_notify(memcg); + + mem_cgroup_unmark_under_oom(memcg); if (mem_cgroup_out_of_memory(memcg, mask, order)) - return OOM_SUCCESS; + ret = OOM_SUCCESS; + else + ret = OOM_FAILED; + + if (locked) + mem_cgroup_oom_unlock(memcg); - return OOM_FAILED; + return ret; } /** _ Patches currently in -mm which might be from mhocko(a)suse.com are mm-memcg-fix-reclaim-deadlock-with-writeback.patch

6 years, 8 months

1
0
0 0

[merged] mm-swap-fix-swapoff-with-ksm-pages.patch removed from -mm tree

by akpm＠linux-foundation.org

The patch titled Subject: mm, swap: fix swapoff with KSM pages has been removed from the -mm tree. Its filename was mm-swap-fix-swapoff-with-ksm-pages.patch This patch was dropped because it was merged into mainline or a subsystem tree ------------------------------------------------------ From: Huang Ying <ying.huang(a)intel.com> Subject: mm, swap: fix swapoff with KSM pages KSM pages may be mapped to the multiple VMAs that cannot be reached from one anon_vma. So during swapin, a new copy of the page need to be generated if a different anon_vma is needed, please refer to comments of ksm_might_need_to_copy() for details. During swapoff, unuse_vma() uses anon_vma (if available) to locate VMA and virtual address mapped to the page, so not all mappings to a swapped out KSM page could be found. So in try_to_unuse(), even if the swap count of a swap entry isn't zero, the page needs to be deleted from swap cache, so that, in the next round a new page could be allocated and swapin for the other mappings of the swapped out KSM page. But this contradicts with the THP swap support. Where the THP could be deleted from swap cache only after the swap count of every swap entry in the huge swap cluster backing the THP has reach 0. So try_to_unuse() is changed in commit e07098294adf ("mm, THP, swap: support to reclaim swap space for THP swapped out") to check that before delete a page from swap cache, but this has broken KSM swapoff too. Fortunately, KSM is for the normal pages only, so the original behavior for KSM pages could be restored easily via checking PageTransCompound(). That is how this patch works. The bug is introduced by e07098294adf ("mm, THP, swap: support to reclaim swap space for THP swapped out"), which is merged by v4.14-rc1. So I think we should backport the fix to from 4.14 on. But Hugh thinks it may be rare for the KSM pages being in the swap device when swapoff, so nobody reports the bug so far. Link: http://lkml.kernel.org/r/20181226051522.28442-1-ying.huang@intel.com Fixes: e07098294adf ("mm, THP, swap: support to reclaim swap space for THP swapped out") Signed-off-by: "Huang, Ying" <ying.huang(a)intel.com> Reported-by: Hugh Dickins <hughd(a)google.com> Tested-by: Hugh Dickins <hughd(a)google.com> Acked-by: Hugh Dickins <hughd(a)google.com> Cc: Rik van Riel <riel(a)redhat.com> Cc: Johannes Weiner <hannes(a)cmpxchg.org> Cc: Minchan Kim <minchan(a)kernel.org> Cc: Shaohua Li <shli(a)kernel.org> Cc: Daniel Jordan <daniel.m.jordan(a)oracle.com> Cc: <stable(a)vger.kernel.org> Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org> --- --- a/mm/swapfile.c~mm-swap-fix-swapoff-with-ksm-pages +++ a/mm/swapfile.c @@ -2197,7 +2197,8 @@ int try_to_unuse(unsigned int type, bool */ if (PageSwapCache(page) && likely(page_private(page) == entry.val) && - !page_swapped(page)) + (!PageTransCompound(page) || + !swap_page_trans_huge_swapped(si, entry))) delete_from_swap_cache(compound_head(page)); /* _ Patches currently in -mm which might be from ying.huang(a)intel.com are mm-swap-fix-race-between-swapoff-and-some-swap-operations.patch mm-swap-fix-race-between-swapoff-and-some-swap-operations-v6.patch mm-fix-race-between-swapoff-and-mincore.patch

6 years, 8 months

1
0
0 0

[merged] hugetlbfs-use-i_mmap_rwsem-to-fix-page-fault-truncate-race.patch removed from -mm tree

by akpm＠linux-foundation.org

The patch titled Subject: hugetlbfs: Use i_mmap_rwsem to fix page fault/truncate race has been removed from the -mm tree. Its filename was hugetlbfs-use-i_mmap_rwsem-to-fix-page-fault-truncate-race.patch This patch was dropped because it was merged into mainline or a subsystem tree ------------------------------------------------------ From: Mike Kravetz <mike.kravetz(a)oracle.com> Subject: hugetlbfs: Use i_mmap_rwsem to fix page fault/truncate race hugetlbfs page faults can race with truncate and hole punch operations. Current code in the page fault path attempts to handle this by 'backing out' operations if we encounter the race. One obvious omission in the current code is removing a page newly added to the page cache. This is pretty straight forward to address, but there is a more subtle and difficult issue of backing out hugetlb reservations. To handle this correctly, the 'reservation state' before page allocation needs to be noted so that it can be properly backed out. There are four distinct possibilities for reservation state: shared/reserved, shared/no-resv, private/reserved and private/no-resv. Backing out a reservation may require memory allocation which could fail so that needs to be taken into account as well. Instead of writing the required complicated code for this rare occurrence, just eliminate the race. i_mmap_rwsem is now held in read mode for the duration of page fault processing. Hold i_mmap_rwsem longer in truncation and hold punch code to cover the call to remove_inode_hugepages. With this modification, code in remove_inode_hugepages checking for races becomes 'dead' as it can not longer happen. Remove the dead code and expand comments to explain reasoning. Similarly, checks for races with truncation in the page fault path can be simplified and removed. [mike.kravetz(a)oracle.com: incorporat suggestions from Kirill] Link: http://lkml.kernel.org/r/20181222223013.22193-3-mike.kravetz@oracle.com Link: http://lkml.kernel.org/r/20181218223557.5202-3-mike.kravetz@oracle.com Fixes: ebed4bfc8da8 ("hugetlb: fix absurd HugePages_Rsvd") Signed-off-by: Mike Kravetz <mike.kravetz(a)oracle.com> Acked-by: Kirill A. Shutemov <kirill.shutemov(a)linux.intel.com> Cc: Michal Hocko <mhocko(a)kernel.org> Cc: Hugh Dickins <hughd(a)google.com> Cc: Naoya Horiguchi <n-horiguchi(a)ah.jp.nec.com> Cc: "Aneesh Kumar K . V" <aneesh.kumar(a)linux.vnet.ibm.com> Cc: Andrea Arcangeli <aarcange(a)redhat.com> Cc: Davidlohr Bueso <dave(a)stgolabs.net> Cc: Prakash Sangappa <prakash.sangappa(a)oracle.com> Cc: <stable(a)vger.kernel.org> Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org> --- --- a/fs/hugetlbfs/inode.c~hugetlbfs-use-i_mmap_rwsem-to-fix-page-fault-truncate-race +++ a/fs/hugetlbfs/inode.c @@ -383,17 +383,16 @@ hugetlb_vmdelete_list(struct rb_root_cac * truncation is indicated by end of range being LLONG_MAX * In this case, we first scan the range and release found pages. * After releasing pages, hugetlb_unreserve_pages cleans up region/reserv - * maps and global counts. Page faults can not race with truncation - * in this routine. hugetlb_no_page() prevents page faults in the - * truncated range. It checks i_size before allocation, and again after - * with the page table lock for the page held. The same lock must be - * acquired to unmap a page. + * maps and global counts. * hole punch is indicated if end is not LLONG_MAX * In the hole punch case we scan the range and release found pages. * Only when releasing a page is the associated region/reserv map * deleted. The region/reserv map for ranges without associated - * pages are not modified. Page faults can race with hole punch. - * This is indicated if we find a mapped page. + * pages are not modified. + * + * Callers of this routine must hold the i_mmap_rwsem in write mode to prevent + * races with page faults. + * * Note: If the passed end of range value is beyond the end of file, but * not LLONG_MAX this routine still performs a hole punch operation. */ @@ -423,32 +422,14 @@ static void remove_inode_hugepages(struc for (i = 0; i < pagevec_count(&pvec); ++i) { struct page *page = pvec.pages[i]; - u32 hash; index = page->index; - hash = hugetlb_fault_mutex_hash(h, current->mm, - &pseudo_vma, - mapping, index, 0); - mutex_lock(&hugetlb_fault_mutex_table[hash]); - /* - * If page is mapped, it was faulted in after being - * unmapped in caller. Unmap (again) now after taking - * the fault mutex. The mutex will prevent faults - * until we finish removing the page. - * - * This race can only happen in the hole punch case. - * Getting here in a truncate operation is a bug. + * A mapped page is impossible as callers should unmap + * all references before calling. And, i_mmap_rwsem + * prevents the creation of additional mappings. */ - if (unlikely(page_mapped(page))) { - BUG_ON(truncate_op); - - i_mmap_lock_write(mapping); - hugetlb_vmdelete_list(&mapping->i_mmap, - index * pages_per_huge_page(h), - (index + 1) * pages_per_huge_page(h)); - i_mmap_unlock_write(mapping); - } + VM_BUG_ON(page_mapped(page)); lock_page(page); /* @@ -470,7 +451,6 @@ static void remove_inode_hugepages(struc } unlock_page(page); - mutex_unlock(&hugetlb_fault_mutex_table[hash]); } huge_pagevec_release(&pvec); cond_resched(); @@ -482,9 +462,20 @@ static void remove_inode_hugepages(struc static void hugetlbfs_evict_inode(struct inode *inode) { + struct address_space *mapping = inode->i_mapping; struct resv_map *resv_map; + /* + * The vfs layer guarantees that there are no other users of this + * inode. Therefore, it would be safe to call remove_inode_hugepages + * without holding i_mmap_rwsem. We acquire and hold here to be + * consistent with other callers. Since there will be no contention + * on the semaphore, overhead is negligible. + */ + i_mmap_lock_write(mapping); remove_inode_hugepages(inode, 0, LLONG_MAX); + i_mmap_unlock_write(mapping); + resv_map = (struct resv_map *)inode->i_mapping->private_data; /* root inode doesn't have the resv_map, so we should check it */ if (resv_map) @@ -505,8 +496,8 @@ static int hugetlb_vmtruncate(struct ino i_mmap_lock_write(mapping); if (!RB_EMPTY_ROOT(&mapping->i_mmap.rb_root)) hugetlb_vmdelete_list(&mapping->i_mmap, pgoff, 0); - i_mmap_unlock_write(mapping); remove_inode_hugepages(inode, offset, LLONG_MAX); + i_mmap_unlock_write(mapping); return 0; } @@ -540,8 +531,8 @@ static long hugetlbfs_punch_hole(struct hugetlb_vmdelete_list(&mapping->i_mmap, hole_start >> PAGE_SHIFT, hole_end >> PAGE_SHIFT); - i_mmap_unlock_write(mapping); remove_inode_hugepages(inode, hole_start, hole_end); + i_mmap_unlock_write(mapping); inode_unlock(inode); } @@ -624,7 +615,11 @@ static long hugetlbfs_fallocate(struct f /* addr is the offset within the file (zero based) */ addr = index * hpage_size; - /* mutex taken here, fault path and hole punch */ + /* + * fault mutex taken here, protects against fault path + * and hole punch. inode_lock previously taken protects + * against truncation. + */ hash = hugetlb_fault_mutex_hash(h, mm, &pseudo_vma, mapping, index, addr); mutex_lock(&hugetlb_fault_mutex_table[hash]); --- a/mm/hugetlb.c~hugetlbfs-use-i_mmap_rwsem-to-fix-page-fault-truncate-race +++ a/mm/hugetlb.c @@ -3755,16 +3755,16 @@ static vm_fault_t hugetlb_no_page(struct } /* - * Use page lock to guard against racing truncation - * before we get page_table_lock. + * We can not race with truncation due to holding i_mmap_rwsem. + * Check once here for faults beyond end of file. */ + size = i_size_read(mapping->host) >> huge_page_shift(h); + if (idx >= size) + goto out; + retry: page = find_lock_page(mapping, idx); if (!page) { - size = i_size_read(mapping->host) >> huge_page_shift(h); - if (idx >= size) - goto out; - /* * Check for page in userfault range */ @@ -3854,9 +3854,6 @@ retry: } ptl = huge_pte_lock(h, mm, ptep); - size = i_size_read(mapping->host) >> huge_page_shift(h); - if (idx >= size) - goto backout; ret = 0; if (!huge_pte_none(huge_ptep_get(ptep))) @@ -3959,8 +3956,10 @@ vm_fault_t hugetlb_fault(struct mm_struc /* * Acquire i_mmap_rwsem before calling huge_pte_alloc and hold - * until finished with ptep. This prevents huge_pmd_unshare from - * being called elsewhere and making the ptep no longer valid. + * until finished with ptep. This serves two purposes: + * 1) It prevents huge_pmd_unshare from being called elsewhere + * and making the ptep no longer valid. + * 2) It synchronizes us with file truncation. * * ptep could have already be assigned via huge_pte_offset. That * is OK, as huge_pte_alloc will return the same value unless _ Patches currently in -mm which might be from mike.kravetz(a)oracle.com are

6 years, 8 months

1
0
0 0

[merged] hugetlbfs-use-i_mmap_rwsem-for-more-pmd-sharing-synchronization.patch removed from -mm tree

by akpm＠linux-foundation.org

The patch titled Subject: hugetlbfs: use i_mmap_rwsem for more pmd sharing synchronization has been removed from the -mm tree. Its filename was hugetlbfs-use-i_mmap_rwsem-for-more-pmd-sharing-synchronization.patch This patch was dropped because it was merged into mainline or a subsystem tree ------------------------------------------------------ From: Mike Kravetz <mike.kravetz(a)oracle.com> Subject: hugetlbfs: use i_mmap_rwsem for more pmd sharing synchronization While looking at BUGs associated with invalid huge page map counts, it was discovered and observed that a huge pte pointer could become 'invalid' and point to another task's page table. Consider the following: A task takes a page fault on a shared hugetlbfs file and calls huge_pte_alloc to get a ptep. Suppose the returned ptep points to a shared pmd. Now, another task truncates the hugetlbfs file. As part of truncation, it unmaps everyone who has the file mapped. If the range being truncated is covered by a shared pmd, huge_pmd_unshare will be called. For all but the last user of the shared pmd, huge_pmd_unshare will clear the pud pointing to the pmd. If the task in the middle of the page fault is not the last user, the ptep returned by huge_pte_alloc now points to another task's page table or worse. This leads to bad things such as incorrect page map/reference counts or invalid memory references. To fix, expand the use of i_mmap_rwsem as follows: - i_mmap_rwsem is held in read mode whenever huge_pmd_share is called. huge_pmd_share is only called via huge_pte_alloc, so callers of huge_pte_alloc take i_mmap_rwsem before calling. In addition, callers of huge_pte_alloc continue to hold the semaphore until finished with the ptep. - i_mmap_rwsem is held in write mode whenever huge_pmd_unshare is called. [mike.kravetz(a)oracle.com: add explicit check for mapping != null] Link: http://lkml.kernel.org/r/20181218223557.5202-2-mike.kravetz@oracle.com Fixes: 39dde65c9940 ("shared page table for hugetlb page") Signed-off-by: Mike Kravetz <mike.kravetz(a)oracle.com> Acked-by: Kirill A. Shutemov <kirill.shutemov(a)linux.intel.com> Cc: Michal Hocko <mhocko(a)kernel.org> Cc: Hugh Dickins <hughd(a)google.com> Cc: Naoya Horiguchi <n-horiguchi(a)ah.jp.nec.com> Cc: "Aneesh Kumar K . V" <aneesh.kumar(a)linux.vnet.ibm.com> Cc: Andrea Arcangeli <aarcange(a)redhat.com> Cc: Davidlohr Bueso <dave(a)stgolabs.net> Cc: Prakash Sangappa <prakash.sangappa(a)oracle.com> Cc: Colin Ian King <colin.king(a)canonical.com> Cc: <stable(a)vger.kernel.org> Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org> --- --- a/mm/hugetlb.c~hugetlbfs-use-i_mmap_rwsem-for-more-pmd-sharing-synchronization +++ a/mm/hugetlb.c @@ -3238,6 +3238,7 @@ int copy_hugetlb_page_range(struct mm_st struct page *ptepage; unsigned long addr; int cow; + struct address_space *mapping = vma->vm_file->f_mapping; struct hstate *h = hstate_vma(vma); unsigned long sz = huge_page_size(h); struct mmu_notifier_range range; @@ -3249,13 +3250,23 @@ int copy_hugetlb_page_range(struct mm_st mmu_notifier_range_init(&range, src, vma->vm_start, vma->vm_end); mmu_notifier_invalidate_range_start(&range); + } else { + /* + * For shared mappings i_mmap_rwsem must be held to call + * huge_pte_alloc, otherwise the returned ptep could go + * away if part of a shared pmd and another thread calls + * huge_pmd_unshare. + */ + i_mmap_lock_read(mapping); } for (addr = vma->vm_start; addr < vma->vm_end; addr += sz) { spinlock_t *src_ptl, *dst_ptl; + src_pte = huge_pte_offset(src, addr, sz); if (!src_pte) continue; + dst_pte = huge_pte_alloc(dst, addr, sz); if (!dst_pte) { ret = -ENOMEM; @@ -3326,6 +3337,8 @@ int copy_hugetlb_page_range(struct mm_st if (cow) mmu_notifier_invalidate_range_end(&range); + else + i_mmap_unlock_read(mapping); return ret; } @@ -3771,14 +3784,18 @@ retry: }; /* - * hugetlb_fault_mutex must be dropped before - * handling userfault. Reacquire after handling - * fault to make calling code simpler. + * hugetlb_fault_mutex and i_mmap_rwsem must be + * dropped before handling userfault. Reacquire + * after handling fault to make calling code simpler. */ hash = hugetlb_fault_mutex_hash(h, mm, vma, mapping, idx, haddr); mutex_unlock(&hugetlb_fault_mutex_table[hash]); + i_mmap_unlock_read(mapping); + ret = handle_userfault(&vmf, VM_UFFD_MISSING); + + i_mmap_lock_read(mapping); mutex_lock(&hugetlb_fault_mutex_table[hash]); goto out; } @@ -3926,6 +3943,11 @@ vm_fault_t hugetlb_fault(struct mm_struc ptep = huge_pte_offset(mm, haddr, huge_page_size(h)); if (ptep) { + /* + * Since we hold no locks, ptep could be stale. That is + * OK as we are only making decisions based on content and + * not actually modifying content here. + */ entry = huge_ptep_get(ptep); if (unlikely(is_hugetlb_entry_migration(entry))) { migration_entry_wait_huge(vma, mm, ptep); @@ -3933,20 +3955,31 @@ vm_fault_t hugetlb_fault(struct mm_struc } else if (unlikely(is_hugetlb_entry_hwpoisoned(entry))) return VM_FAULT_HWPOISON_LARGE | VM_FAULT_SET_HINDEX(hstate_index(h)); - } else { - ptep = huge_pte_alloc(mm, haddr, huge_page_size(h)); - if (!ptep) - return VM_FAULT_OOM; } + /* + * Acquire i_mmap_rwsem before calling huge_pte_alloc and hold + * until finished with ptep. This prevents huge_pmd_unshare from + * being called elsewhere and making the ptep no longer valid. + * + * ptep could have already be assigned via huge_pte_offset. That + * is OK, as huge_pte_alloc will return the same value unless + * something changed. + */ mapping = vma->vm_file->f_mapping; - idx = vma_hugecache_offset(h, vma, haddr); + i_mmap_lock_read(mapping); + ptep = huge_pte_alloc(mm, haddr, huge_page_size(h)); + if (!ptep) { + i_mmap_unlock_read(mapping); + return VM_FAULT_OOM; + } /* * Serialize hugepage allocation and instantiation, so that we don't * get spurious allocation failures if two CPUs race to instantiate * the same page in the page cache. */ + idx = vma_hugecache_offset(h, vma, haddr); hash = hugetlb_fault_mutex_hash(h, mm, vma, mapping, idx, haddr); mutex_lock(&hugetlb_fault_mutex_table[hash]); @@ -4034,6 +4067,7 @@ out_ptl: } out_mutex: mutex_unlock(&hugetlb_fault_mutex_table[hash]); + i_mmap_unlock_read(mapping); /* * Generally it's safe to hold refcount during waiting page lock. But * here we just wait to defer the next page fault to avoid busy loop and @@ -4638,10 +4672,12 @@ void adjust_range_if_pmd_sharing_possibl * Search for a shareable pmd page for hugetlb. In any case calls pmd_alloc() * and returns the corresponding pte. While this is not necessary for the * !shared pmd case because we can allocate the pmd later as well, it makes the - * code much cleaner. pmd allocation is essential for the shared case because - * pud has to be populated inside the same i_mmap_rwsem section - otherwise - * racing tasks could either miss the sharing (see huge_pte_offset) or select a - * bad pmd for sharing. + * code much cleaner. + * + * This routine must be called with i_mmap_rwsem held in at least read mode. + * For hugetlbfs, this prevents removal of any page table entries associated + * with the address space. This is important as we are setting up sharing + * based on existing page table entries (mappings). */ pte_t *huge_pmd_share(struct mm_struct *mm, unsigned long addr, pud_t *pud) { @@ -4658,7 +4694,6 @@ pte_t *huge_pmd_share(struct mm_struct * if (!vma_shareable(vma, addr)) return (pte_t *)pmd_alloc(mm, pud, addr); - i_mmap_lock_write(mapping); vma_interval_tree_foreach(svma, &mapping->i_mmap, idx, idx) { if (svma == vma) continue; @@ -4688,7 +4723,6 @@ pte_t *huge_pmd_share(struct mm_struct * spin_unlock(ptl); out: pte = (pte_t *)pmd_alloc(mm, pud, addr); - i_mmap_unlock_write(mapping); return pte; } @@ -4699,7 +4733,7 @@ out: * indicated by page_count > 1, unmap is achieved by clearing pud and * decrementing the ref count. If count == 1, the pte page is not shared. * - * called with page table lock held. + * Called with page table lock held and i_mmap_rwsem held in write mode. * * returns: 1 successfully unmapped a shared pte page * 0 the underlying pte page is not shared, or it is the last user --- a/mm/memory-failure.c~hugetlbfs-use-i_mmap_rwsem-for-more-pmd-sharing-synchronization +++ a/mm/memory-failure.c @@ -966,7 +966,7 @@ static bool hwpoison_user_mappings(struc enum ttu_flags ttu = TTU_IGNORE_MLOCK | TTU_IGNORE_ACCESS; struct address_space *mapping; LIST_HEAD(tokill); - bool unmap_success; + bool unmap_success = true; int kill = 1, forcekill; struct page *hpage = *hpagep; bool mlocked = PageMlocked(hpage); @@ -1028,7 +1028,19 @@ static bool hwpoison_user_mappings(struc if (kill) collect_procs(hpage, &tokill, flags & MF_ACTION_REQUIRED); - unmap_success = try_to_unmap(hpage, ttu); + if (!PageHuge(hpage)) { + unmap_success = try_to_unmap(hpage, ttu); + } else if (mapping) { + /* + * For hugetlb pages, try_to_unmap could potentially call + * huge_pmd_unshare. Because of this, take semaphore in + * write mode here and set TTU_RMAP_LOCKED to indicate we + * have taken the lock at this higer level. + */ + i_mmap_lock_write(mapping); + unmap_success = try_to_unmap(hpage, ttu|TTU_RMAP_LOCKED); + i_mmap_unlock_write(mapping); + } if (!unmap_success) pr_err("Memory failure: %#lx: failed to unmap page (mapcount=%d)\n", pfn, page_mapcount(hpage)); --- a/mm/migrate.c~hugetlbfs-use-i_mmap_rwsem-for-more-pmd-sharing-synchronization +++ a/mm/migrate.c @@ -1324,8 +1324,19 @@ static int unmap_and_move_huge_page(new_ goto put_anon; if (page_mapped(hpage)) { + struct address_space *mapping = page_mapping(hpage); + + /* + * try_to_unmap could potentially call huge_pmd_unshare. + * Because of this, take semaphore in write mode here and + * set TTU_RMAP_LOCKED to let lower levels know we have + * taken the lock. + */ + i_mmap_lock_write(mapping); try_to_unmap(hpage, - TTU_MIGRATION|TTU_IGNORE_MLOCK|TTU_IGNORE_ACCESS); + TTU_MIGRATION|TTU_IGNORE_MLOCK|TTU_IGNORE_ACCESS| + TTU_RMAP_LOCKED); + i_mmap_unlock_write(mapping); page_was_mapped = 1; } --- a/mm/rmap.c~hugetlbfs-use-i_mmap_rwsem-for-more-pmd-sharing-synchronization +++ a/mm/rmap.c @@ -25,6 +25,7 @@ * page->flags PG_locked (lock_page) * hugetlbfs_i_mmap_rwsem_key (in huge_pmd_share) * mapping->i_mmap_rwsem + * hugetlb_fault_mutex (hugetlbfs specific page fault mutex) * anon_vma->rwsem * mm->page_table_lock or pte_lock * zone_lru_lock (in mark_page_accessed, isolate_lru_page) @@ -1378,6 +1379,9 @@ static bool try_to_unmap_one(struct page /* * If sharing is possible, start and end will be adjusted * accordingly. + * + * If called for a huge page, caller must hold i_mmap_rwsem + * in write mode as it is possible to call huge_pmd_unshare. */ adjust_range_if_pmd_sharing_possible(vma, &range.start, &range.end); --- a/mm/userfaultfd.c~hugetlbfs-use-i_mmap_rwsem-for-more-pmd-sharing-synchronization +++ a/mm/userfaultfd.c @@ -267,10 +267,14 @@ retry: VM_BUG_ON(dst_addr & ~huge_page_mask(h)); /* - * Serialize via hugetlb_fault_mutex + * Serialize via i_mmap_rwsem and hugetlb_fault_mutex. + * i_mmap_rwsem ensures the dst_pte remains valid even + * in the case of shared pmds. fault mutex prevents + * races with other faulting threads. */ - idx = linear_page_index(dst_vma, dst_addr); mapping = dst_vma->vm_file->f_mapping; + i_mmap_lock_read(mapping); + idx = linear_page_index(dst_vma, dst_addr); hash = hugetlb_fault_mutex_hash(h, dst_mm, dst_vma, mapping, idx, dst_addr); mutex_lock(&hugetlb_fault_mutex_table[hash]); @@ -279,6 +283,7 @@ retry: dst_pte = huge_pte_alloc(dst_mm, dst_addr, huge_page_size(h)); if (!dst_pte) { mutex_unlock(&hugetlb_fault_mutex_table[hash]); + i_mmap_unlock_read(mapping); goto out_unlock; } @@ -286,6 +291,7 @@ retry: dst_pteval = huge_ptep_get(dst_pte); if (!huge_pte_none(dst_pteval)) { mutex_unlock(&hugetlb_fault_mutex_table[hash]); + i_mmap_unlock_read(mapping); goto out_unlock; } @@ -293,6 +299,7 @@ retry: dst_addr, src_addr, &page); mutex_unlock(&hugetlb_fault_mutex_table[hash]); + i_mmap_unlock_read(mapping); vm_alloc_shared = vm_shared; cond_resched(); _ Patches currently in -mm which might be from mike.kravetz(a)oracle.com are

6 years, 8 months

1
0
0 0

[merged] hwpoison-memory_hotplug-allow-hwpoisoned-pages-to-be-offlined.patch removed from -mm tree

by akpm＠linux-foundation.org

The patch titled Subject: hwpoison, memory_hotplug: allow hwpoisoned pages to be offlined has been removed from the -mm tree. Its filename was hwpoison-memory_hotplug-allow-hwpoisoned-pages-to-be-offlined.patch This patch was dropped because it was merged into mainline or a subsystem tree ------------------------------------------------------ From: Michal Hocko <mhocko(a)suse.com> Subject: hwpoison, memory_hotplug: allow hwpoisoned pages to be offlined We have received a bug report that an injected MCE about faulty memory prevents memory offline to succeed on 4.4 base kernel. The underlying reason was that the HWPoison page has an elevated reference count and the migration keeps failing. There are two problems with that. First of all it is dubious to migrate the poisoned page because we know that accessing that memory is possible to fail. Secondly it doesn't make any sense to migrate a potentially broken content and preserve the memory corruption over to a new location. Oscar has found out that 4.4 and the current upstream kernels behave slightly differently with his simply testcase === int main(void) { int ret; int i; int fd; char *array = malloc(4096); char *array_locked = malloc(4096); fd = open("/tmp/data", O_RDONLY); read(fd, array, 4095); for (i = 0; i < 4096; i++) array_locked[i] = 'd'; ret = mlock((void *)PAGE_ALIGN((unsigned long)array_locked), sizeof(array_locked)); if (ret) perror("mlock"); sleep (20); ret = madvise((void *)PAGE_ALIGN((unsigned long)array_locked), 4096, MADV_HWPOISON); if (ret) perror("madvise"); for (i = 0; i < 4096; i++) array_locked[i] = 'd'; return 0; } === + offline this memory. In 4.4 kernels he saw the hwpoisoned page to be returned back to the LRU list kernel: [<ffffffff81019ac9>] dump_trace+0x59/0x340 kernel: [<ffffffff81019e9a>] show_stack_log_lvl+0xea/0x170 kernel: [<ffffffff8101ac71>] show_stack+0x21/0x40 kernel: [<ffffffff8132bb90>] dump_stack+0x5c/0x7c kernel: [<ffffffff810815a1>] warn_slowpath_common+0x81/0xb0 kernel: [<ffffffff811a275c>] __pagevec_lru_add_fn+0x14c/0x160 kernel: [<ffffffff811a2eed>] pagevec_lru_move_fn+0xad/0x100 kernel: [<ffffffff811a334c>] __lru_cache_add+0x6c/0xb0 kernel: [<ffffffff81195236>] add_to_page_cache_lru+0x46/0x70 kernel: [<ffffffffa02b4373>] extent_readpages+0xc3/0x1a0 [btrfs] kernel: [<ffffffff811a16d7>] __do_page_cache_readahead+0x177/0x200 kernel: [<ffffffff811a18c8>] ondemand_readahead+0x168/0x2a0 kernel: [<ffffffff8119673f>] generic_file_read_iter+0x41f/0x660 kernel: [<ffffffff8120e50d>] __vfs_read+0xcd/0x140 kernel: [<ffffffff8120e9ea>] vfs_read+0x7a/0x120 kernel: [<ffffffff8121404b>] kernel_read+0x3b/0x50 kernel: [<ffffffff81215c80>] do_execveat_common.isra.29+0x490/0x6f0 kernel: [<ffffffff81215f08>] do_execve+0x28/0x30 kernel: [<ffffffff81095ddb>] call_usermodehelper_exec_async+0xfb/0x130 kernel: [<ffffffff8161c045>] ret_from_fork+0x55/0x80 And that latter confuses the hotremove path because an LRU page is attempted to be migrated and that fails due to an elevated reference count. It is quite possible that the reuse of the HWPoisoned page is some kind of fixed race condition but I am not really sure about that. With the upstream kernel the failure is slightly different. The page doesn't seem to have LRU bit set but isolate_movable_page simply fails and do_migrate_range simply puts all the isolated pages back to LRU and therefore no progress is made and scan_movable_pages finds same set of pages over and over again. Fix both cases by explicitly checking HWPoisoned pages before we even try to get reference on the page, try to unmap it if it is still mapped. As explained by Naoya: : Hwpoison code never unmapped those for no big reason because : Ksm pages never dominate memory, so we simply didn't have strong : motivation to save the pages. Also put WARN_ON(PageLRU) in case there is a race and we can hit LRU HWPoison pages which shouldn't happen but I couldn't convince myself about that. Naoya has noted the following: : Theoretically no such gurantee, because try_to_unmap() doesn't have a : guarantee of success and then memory_failure() returns immediately : when hwpoison_user_mappings fails. : Or the following code (comes after hwpoison_user_mappings block) also impli= : es : that the target page can still have PageLRU flag. : : /* : * Torn down by someone else? : */ : if (PageLRU(p) && !PageSwapCache(p) && p->mapping =3D=3D NULL) { : action_result(pfn, MF_MSG_TRUNCATED_LRU, MF_IGNORED); : res =3D -EBUSY; : goto out; : } : : So I think it's OK to keep "if (WARN_ON(PageLRU(page)))" block in : current version of your patch. Link: http://lkml.kernel.org/r/20181206120135.14079-1-mhocko@kernel.org Signed-off-by: Michal Hocko <mhocko(a)suse.com> Reviewed-by: Oscar Salvador <osalvador(a)suse.com> Debugged-by: Oscar Salvador <osalvador(a)suse.com> Tested-by: Oscar Salvador <osalvador(a)suse.com> Acked-by: David Hildenbrand <david(a)redhat.com> Acked-by: Naoya Horiguchi <n-horiguchi(a)ah.jp.nec.com> Cc: <stable(a)vger.kernel.org> Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org> --- --- a/mm/memory_hotplug.c~hwpoison-memory_hotplug-allow-hwpoisoned-pages-to-be-offlined +++ a/mm/memory_hotplug.c @@ -34,6 +34,7 @@ #include <linux/hugetlb.h> #include <linux/memblock.h> #include <linux/compaction.h> +#include <linux/rmap.h> #include <asm/tlbflush.h> @@ -1368,6 +1369,21 @@ do_migrate_range(unsigned long start_pfn pfn = page_to_pfn(compound_head(page)) + hpage_nr_pages(page) - 1; + /* + * HWPoison pages have elevated reference counts so the migration would + * fail on them. It also doesn't make any sense to migrate them in the + * first place. Still try to unmap such a page in case it is still mapped + * (e.g. current hwpoison implementation doesn't unmap KSM pages but keep + * the unmap as the catch all safety net). + */ + if (PageHWPoison(page)) { + if (WARN_ON(PageLRU(page))) + isolate_lru_page(page); + if (page_mapped(page)) + try_to_unmap(page, TTU_IGNORE_MLOCK | TTU_IGNORE_ACCESS); + continue; + } + if (!get_page_unless_zero(page)) continue; /* _ Patches currently in -mm which might be from mhocko(a)suse.com are mm-memcg-fix-reclaim-deadlock-with-writeback.patch

6 years, 8 months

1
0
0 0

[merged] zram-fix-double-free-backing-device.patch removed from -mm tree

by akpm＠linux-foundation.org

The patch titled Subject: zram: fix double free backing device has been removed from the -mm tree. Its filename was zram-fix-double-free-backing-device.patch This patch was dropped because it was merged into mainline or a subsystem tree ------------------------------------------------------ From: Minchan Kim <minchan(a)kernel.org> Subject: zram: fix double free backing device If blkdev_get fails, we shouldn't do blkdev_put. Otherwise, kernel emits below log. This patch fixes it. [ 31.073006] WARNING: CPU: 0 PID: 1893 at fs/block_dev.c:1828 blkdev_put+0x105/0x120 [ 31.075104] Modules linked in: [ 31.075898] CPU: 0 PID: 1893 Comm: swapoff Not tainted 4.19.0+ #453 [ 31.077484] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1 04/01/2014 [ 31.079589] RIP: 0010:blkdev_put+0x105/0x120 [ 31.080606] Code: 48 c7 80 a0 00 00 00 00 00 00 00 48 c7 c7 40 e7 40 96 e8 6e 47 73 00 48 8b bb e0 00 00 00 e9 2c ff ff ff 0f 0b e9 75 ff ff ff <0f> 0b e9 5a ff ff ff 48 c7 80 a0 00 00 00 00 00 00 00 eb 87 0f 1f [ 31.085080] RSP: 0018:ffffb409005c7ed0 EFLAGS: 00010297 [ 31.086383] RAX: ffff9779fe5a8040 RBX: ffff9779fbc17300 RCX: 00000000b9fc37a4 [ 31.088105] RDX: 0000000000000001 RSI: 0000000000000000 RDI: ffffffff9640e740 [ 31.089850] RBP: ffff9779fbc17318 R08: ffffffff95499a89 R09: 0000000000000004 [ 31.091201] R10: ffffb409005c7e50 R11: 7a9ef6088ff4d4a1 R12: 0000000000000083 [ 31.092276] R13: ffff9779fe607b98 R14: 0000000000000000 R15: ffff9779fe607a38 [ 31.093355] FS: 00007fc118d9b840(0000) GS:ffff9779fc600000(0000) knlGS:0000000000000000 [ 31.094582] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 31.095541] CR2: 00007fc11894b8dc CR3: 00000000339f6001 CR4: 0000000000160ef0 [ 31.096781] Call Trace: [ 31.097212] __x64_sys_swapoff+0x46d/0x490 [ 31.097914] do_syscall_64+0x5a/0x190 [ 31.098550] entry_SYSCALL_64_after_hwframe+0x49/0xbe [ 31.099402] RIP: 0033:0x7fc11843ec27 [ 31.100013] Code: 73 01 c3 48 8b 0d 71 62 2c 00 f7 d8 64 89 01 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 b8 a8 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 41 62 2c 00 f7 d8 64 89 01 48 [ 31.103149] RSP: 002b:00007ffdf69be648 EFLAGS: 00000206 ORIG_RAX: 00000000000000a8 [ 31.104425] RAX: ffffffffffffffda RBX: 00000000011d98c0 RCX: 00007fc11843ec27 [ 31.105627] RDX: 0000000000000001 RSI: 0000000000000001 RDI: 00000000011d98c0 [ 31.106847] RBP: 0000000000000001 R08: 00007ffdf69be690 R09: 0000000000000001 [ 31.108038] R10: 00000000000002b1 R11: 0000000000000206 R12: 0000000000000001 [ 31.109231] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000 [ 31.110433] irq event stamp: 4466 [ 31.111001] hardirqs last enabled at (4465): [<ffffffff953ebd43>] __free_pages_ok+0x1e3/0x490 [ 31.112437] hardirqs last disabled at (4466): [<ffffffff95201b7a>] trace_hardirqs_off_thunk+0x1a/0x1c [ 31.113973] softirqs last enabled at (3420): [<ffffffff95e00333>] __do_softirq+0x333/0x446 [ 31.115364] softirqs last disabled at (3407): [<ffffffff9527aee1>] irq_exit+0xd1/0xe0 Link: http://lkml.kernel.org/r/20181127055429.251614-3-minchan@kernel.org Signed-off-by: Minchan Kim <minchan(a)kernel.org> Reviewed-by: Sergey Senozhatsky <sergey.senozhatsky(a)gmail.com> Reviewed-by: Joey Pabalinas <joeypabalinas(a)gmail.com> Cc: <stable(a)vger.kernel.org> [4.14+] Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org> --- --- a/drivers/block/zram/zram_drv.c~zram-fix-double-free-backing-device +++ a/drivers/block/zram/zram_drv.c @@ -387,8 +387,10 @@ static ssize_t backing_dev_store(struct bdev = bdgrab(I_BDEV(inode)); err = blkdev_get(bdev, FMODE_READ | FMODE_WRITE | FMODE_EXCL, zram); - if (err < 0) + if (err < 0) { + bdev = NULL; goto out; + } nr_pages = i_size_read(inode) >> PAGE_SHIFT; bitmap_sz = BITS_TO_LONGS(nr_pages) * sizeof(long); _ Patches currently in -mm which might be from minchan(a)kernel.org are zram-idle-writeback-fixes-and-cleanup.patch

6 years, 8 months

1
0
0 0

[merged] mm-hmm-mark-hmm_devmem_add-add_resource-export_symbol_gpl.patch removed from -mm tree

by akpm＠linux-foundation.org

The patch titled Subject: mm, hmm: mark hmm_devmem_{add, add_resource} EXPORT_SYMBOL_GPL has been removed from the -mm tree. Its filename was mm-hmm-mark-hmm_devmem_add-add_resource-export_symbol_gpl.patch This patch was dropped because it was merged into mainline or a subsystem tree ------------------------------------------------------ From: Dan Williams <dan.j.williams(a)intel.com> Subject: mm, hmm: mark hmm_devmem_{add, add_resource} EXPORT_SYMBOL_GPL At Maintainer Summit, Greg brought up a topic I proposed around EXPORT_SYMBOL_GPL usage. The motivation was considerations for when EXPORT_SYMBOL_GPL is warranted and the criteria for taking the exceptional step of reclassifying an existing export. Specifically, I wanted to make the case that although the line is fuzzy and hard to specify in abstract terms, it is nonetheless clear that devm_memremap_pages() and HMM (Heterogeneous Memory Management) have crossed it. The devm_memremap_pages() facility should have been EXPORT_SYMBOL_GPL from the beginning, and HMM as a derivative of that functionality should have naturally picked up that designation as well. Contrary to typical rules, the HMM infrastructure was merged upstream with zero in-tree consumers. There was a promise at the time that those users would be merged "soon", but it has been over a year with no drivers arriving. While the Nouveau driver is about to belatedly make good on that promise it is clear that HMM was targeted first and foremost at an out-of-tree consumer. HMM is derived from devm_memremap_pages(), a facility Christoph and I spearheaded to support persistent memory. It combines a device lifetime model with a dynamically created 'struct page' / memmap array for any physical address range. It enables coordination and control of the many code paths in the kernel built to interact with memory via 'struct page' objects. With HMM the integration goes even deeper by allowing device drivers to hook and manipulate page fault and page free events. One interpretation of when EXPORT_SYMBOL is suitable is when it is exporting stable and generic leaf functionality. The devm_memremap_pages() facility continues to see expanding use cases, peer-to-peer DMA being the most recent, with no clear end date when it will stop attracting reworks and semantic changes. It is not suitable to export devm_memremap_pages() as a stable 3rd party driver API due to the fact that it is still changing and manipulates core behavior. Moreover, it is not in the best interest of the long term development of the core memory management subsystem to permit any external driver to effectively define its own system-wide memory management policies with no encouragement to engage with upstream. I am also concerned that HMM was designed in a way to minimize further engagement with the core-MM. That, with these hooks in place, device-drivers are free to implement their own policies without much consideration for whether and how the core-MM could grow to meet that need. Going forward not only should HMM be EXPORT_SYMBOL_GPL, but the core-MM should be allowed the opportunity and stimulus to change and address these new use cases as first class functionality. Original changelog: hmm_devmem_add(), and hmm_devmem_add_resource() duplicated devm_memremap_pages() and are now simple now wrappers around the core facility to inject a dev_pagemap instance into the global pgmap_radix and hook page-idle events. The devm_memremap_pages() interface is base infrastructure for HMM. HMM has more and deeper ties into the kernel memory management implementation than base ZONE_DEVICE which is itself a EXPORT_SYMBOL_GPL facility. Originally, the HMM page structure creation routines copied the devm_memremap_pages() code and reused ZONE_DEVICE. A cleanup to unify the implementations was discussed during the initial review: http://lkml.iu.edu/hypermail/linux/kernel/1701.2/00812.html Recent work to extend devm_memremap_pages() for the peer-to-peer-DMA facility enabled this cleanup to move forward. In addition to the integration with devm_memremap_pages() HMM depends on other GPL-only symbols: mmu_notifier_unregister_no_release percpu_ref region_intersects __class_create It goes further to consume / indirectly expose functionality that is not exported to any other driver: alloc_pages_vma walk_page_range HMM is derived from devm_memremap_pages(), and extends deep core-kernel fundamentals. Similar to devm_memremap_pages(), mark its entry points EXPORT_SYMBOL_GPL(). [logang(a)deltatee.com: PCI/P2PDMA: match interface changes to devm_memremap_pages()] Link: http://lkml.kernel.org/r/20181130225911.2900-1-logang@deltatee.com Link: http://lkml.kernel.org/r/154275560565.76910.15919297436557795278.stgit@dwil… Signed-off-by: Dan Williams <dan.j.williams(a)intel.com> Signed-off-by: Logan Gunthorpe <logang(a)deltatee.com> Reviewed-by: Christoph Hellwig <hch(a)lst.de> Cc: Logan Gunthorpe <logang(a)deltatee.com> Cc: "Jérôme Glisse" <jglisse(a)redhat.com> Cc: Balbir Singh <bsingharora(a)gmail.com>, Cc: Michal Hocko <mhocko(a)suse.com> Cc: Linus Torvalds <torvalds(a)linux-foundation.org> Cc: Benjamin Herrenschmidt <benh(a)kernel.crashing.org> Cc: <stable(a)vger.kernel.org> Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org> --- --- a/drivers/pci/p2pdma.c~mm-hmm-mark-hmm_devmem_add-add_resource-export_symbol_gpl +++ a/drivers/pci/p2pdma.c @@ -82,10 +82,8 @@ static void pci_p2pdma_percpu_release(st complete_all(&p2p->devmap_ref_done); } -static void pci_p2pdma_percpu_kill(void *data) +static void pci_p2pdma_percpu_kill(struct percpu_ref *ref) { - struct percpu_ref *ref = data; - /* * pci_p2pdma_add_resource() may be called multiple times * by a driver and may register the percpu_kill devm action multiple @@ -198,6 +196,7 @@ int pci_p2pdma_add_resource(struct pci_d pgmap->type = MEMORY_DEVICE_PCI_P2PDMA; pgmap->pci_p2pdma_bus_offset = pci_bus_address(pdev, bar) - pci_resource_start(pdev, bar); + pgmap->kill = pci_p2pdma_percpu_kill; addr = devm_memremap_pages(&pdev->dev, pgmap); if (IS_ERR(addr)) { @@ -211,11 +210,6 @@ int pci_p2pdma_add_resource(struct pci_d if (error) goto pgmap_free; - error = devm_add_action_or_reset(&pdev->dev, pci_p2pdma_percpu_kill, - &pdev->p2pdma->devmap_ref); - if (error) - goto pgmap_free; - pci_info(pdev, "added peer-to-peer DMA memory %pR\n", &pgmap->res); --- a/mm/hmm.c~mm-hmm-mark-hmm_devmem_add-add_resource-export_symbol_gpl +++ a/mm/hmm.c @@ -1110,7 +1110,7 @@ struct hmm_devmem *hmm_devmem_add(const return result; return devmem; } -EXPORT_SYMBOL(hmm_devmem_add); +EXPORT_SYMBOL_GPL(hmm_devmem_add); struct hmm_devmem *hmm_devmem_add_resource(const struct hmm_devmem_ops *ops, struct device *device, @@ -1164,7 +1164,7 @@ struct hmm_devmem *hmm_devmem_add_resour return result; return devmem; } -EXPORT_SYMBOL(hmm_devmem_add_resource); +EXPORT_SYMBOL_GPL(hmm_devmem_add_resource); /* * A device driver that wants to handle multiple devices memory through a _ Patches currently in -mm which might be from dan.j.williams(a)intel.com are

6 years, 8 months

1
0
0 0

[merged] mm-hmm-replace-hmm_devmem_pages_create-with-devm_memremap_pages.patch removed from -mm tree

by akpm＠linux-foundation.org

The patch titled Subject: mm, hmm: replace hmm_devmem_pages_create() with devm_memremap_pages() has been removed from the -mm tree. Its filename was mm-hmm-replace-hmm_devmem_pages_create-with-devm_memremap_pages.patch This patch was dropped because it was merged into mainline or a subsystem tree ------------------------------------------------------ From: Dan Williams <dan.j.williams(a)intel.com> Subject: mm, hmm: replace hmm_devmem_pages_create() with devm_memremap_pages() e8d513483300 "memremap: change devm_memremap_pages interface to use struct dev_pagemap" refactored devm_memremap_pages() to allow a dev_pagemap instance to be supplied. Passing in a dev_pagemap interface simplifies the design of pgmap type drivers in that they can rely on container_of() to lookup any private data associated with the given dev_pagemap instance. In addition to the cleanups this also gives hmm users multi-order-radix improvements that arrived with commit ab1b597ee0e4 "mm, devm_memremap_pages: use multi-order radix for ZONE_DEVICE lookups" As part of the conversion to the devm_memremap_pages() method of handling the percpu_ref relative to when pages are put, the percpu_ref completion needs to move to hmm_devmem_ref_exit(). See 71389703839e ("mm, zone_device: Replace {get, put}_zone_device_page...") for details. Link: http://lkml.kernel.org/r/154275560053.76910.10870962637383152392.stgit@dwil… Signed-off-by: Dan Williams <dan.j.williams(a)intel.com> Reviewed-by: Christoph Hellwig <hch(a)lst.de> Reviewed-by: Jérôme Glisse <jglisse(a)redhat.com> Acked-by: Balbir Singh <bsingharora(a)gmail.com> Cc: Logan Gunthorpe <logang(a)deltatee.com> Cc: Michal Hocko <mhocko(a)suse.com> Cc: <stable(a)vger.kernel.org> Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org> --- --- a/mm/hmm.c~mm-hmm-replace-hmm_devmem_pages_create-with-devm_memremap_pages +++ a/mm/hmm.c @@ -986,17 +986,16 @@ static void hmm_devmem_ref_exit(void *da struct hmm_devmem *devmem; devmem = container_of(ref, struct hmm_devmem, ref); + wait_for_completion(&devmem->completion); percpu_ref_exit(ref); } -static void hmm_devmem_ref_kill(void *data) +static void hmm_devmem_ref_kill(struct percpu_ref *ref) { - struct percpu_ref *ref = data; struct hmm_devmem *devmem; devmem = container_of(ref, struct hmm_devmem, ref); percpu_ref_kill(ref); - wait_for_completion(&devmem->completion); } static int hmm_devmem_fault(struct vm_area_struct *vma, @@ -1019,154 +1018,6 @@ static void hmm_devmem_free(struct page devmem->ops->free(devmem, page); } -static DEFINE_MUTEX(hmm_devmem_lock); -static RADIX_TREE(hmm_devmem_radix, GFP_KERNEL); - -static void hmm_devmem_radix_release(struct resource *resource) -{ - resource_size_t key; - - mutex_lock(&hmm_devmem_lock); - for (key = resource->start; - key <= resource->end; - key += PA_SECTION_SIZE) - radix_tree_delete(&hmm_devmem_radix, key >> PA_SECTION_SHIFT); - mutex_unlock(&hmm_devmem_lock); -} - -static void hmm_devmem_release(void *data) -{ - struct hmm_devmem *devmem = data; - struct resource *resource = devmem->resource; - unsigned long start_pfn, npages; - struct zone *zone; - struct page *page; - - /* pages are dead and unused, undo the arch mapping */ - start_pfn = (resource->start & ~(PA_SECTION_SIZE - 1)) >> PAGE_SHIFT; - npages = ALIGN(resource_size(resource), PA_SECTION_SIZE) >> PAGE_SHIFT; - - page = pfn_to_page(start_pfn); - zone = page_zone(page); - - mem_hotplug_begin(); - if (resource->desc == IORES_DESC_DEVICE_PRIVATE_MEMORY) - __remove_pages(zone, start_pfn, npages, NULL); - else - arch_remove_memory(start_pfn << PAGE_SHIFT, - npages << PAGE_SHIFT, NULL); - mem_hotplug_done(); - - hmm_devmem_radix_release(resource); -} - -static int hmm_devmem_pages_create(struct hmm_devmem *devmem) -{ - resource_size_t key, align_start, align_size, align_end; - struct device *device = devmem->device; - int ret, nid, is_ram; - - align_start = devmem->resource->start & ~(PA_SECTION_SIZE - 1); - align_size = ALIGN(devmem->resource->start + - resource_size(devmem->resource), - PA_SECTION_SIZE) - align_start; - - is_ram = region_intersects(align_start, align_size, - IORESOURCE_SYSTEM_RAM, - IORES_DESC_NONE); - if (is_ram == REGION_MIXED) { - WARN_ONCE(1, "%s attempted on mixed region %pr\n", - __func__, devmem->resource); - return -ENXIO; - } - if (is_ram == REGION_INTERSECTS) - return -ENXIO; - - if (devmem->resource->desc == IORES_DESC_DEVICE_PUBLIC_MEMORY) - devmem->pagemap.type = MEMORY_DEVICE_PUBLIC; - else - devmem->pagemap.type = MEMORY_DEVICE_PRIVATE; - - devmem->pagemap.res = *devmem->resource; - devmem->pagemap.page_fault = hmm_devmem_fault; - devmem->pagemap.page_free = hmm_devmem_free; - devmem->pagemap.dev = devmem->device; - devmem->pagemap.ref = &devmem->ref; - devmem->pagemap.data = devmem; - - mutex_lock(&hmm_devmem_lock); - align_end = align_start + align_size - 1; - for (key = align_start; key <= align_end; key += PA_SECTION_SIZE) { - struct hmm_devmem *dup; - - dup = radix_tree_lookup(&hmm_devmem_radix, - key >> PA_SECTION_SHIFT); - if (dup) { - dev_err(device, "%s: collides with mapping for %s\n", - __func__, dev_name(dup->device)); - mutex_unlock(&hmm_devmem_lock); - ret = -EBUSY; - goto error; - } - ret = radix_tree_insert(&hmm_devmem_radix, - key >> PA_SECTION_SHIFT, - devmem); - if (ret) { - dev_err(device, "%s: failed: %d\n", __func__, ret); - mutex_unlock(&hmm_devmem_lock); - goto error_radix; - } - } - mutex_unlock(&hmm_devmem_lock); - - nid = dev_to_node(device); - if (nid < 0) - nid = numa_mem_id(); - - mem_hotplug_begin(); - /* - * For device private memory we call add_pages() as we only need to - * allocate and initialize struct page for the device memory. More- - * over the device memory is un-accessible thus we do not want to - * create a linear mapping for the memory like arch_add_memory() - * would do. - * - * For device public memory, which is accesible by the CPU, we do - * want the linear mapping and thus use arch_add_memory(). - */ - if (devmem->pagemap.type == MEMORY_DEVICE_PUBLIC) - ret = arch_add_memory(nid, align_start, align_size, NULL, - false); - else - ret = add_pages(nid, align_start >> PAGE_SHIFT, - align_size >> PAGE_SHIFT, NULL, false); - if (ret) { - mem_hotplug_done(); - goto error_add_memory; - } - move_pfn_range_to_zone(&NODE_DATA(nid)->node_zones[ZONE_DEVICE], - align_start >> PAGE_SHIFT, - align_size >> PAGE_SHIFT, NULL); - mem_hotplug_done(); - - /* - * Initialization of the pages has been deferred until now in order - * to allow us to do the work while not holding the hotplug lock. - */ - memmap_init_zone_device(&NODE_DATA(nid)->node_zones[ZONE_DEVICE], - align_start >> PAGE_SHIFT, - align_size >> PAGE_SHIFT, &devmem->pagemap); - - return 0; - -error_add_memory: - untrack_pfn(NULL, PHYS_PFN(align_start), align_size); -error_radix: - hmm_devmem_radix_release(devmem->resource); -error: - return ret; -} - /* * hmm_devmem_add() - hotplug ZONE_DEVICE memory for device memory * @@ -1190,6 +1041,7 @@ struct hmm_devmem *hmm_devmem_add(const { struct hmm_devmem *devmem; resource_size_t addr; + void *result; int ret; dev_pagemap_get_ops(); @@ -1244,14 +1096,18 @@ struct hmm_devmem *hmm_devmem_add(const devmem->pfn_last = devmem->pfn_first + (resource_size(devmem->resource) >> PAGE_SHIFT); - ret = hmm_devmem_pages_create(devmem); - if (ret) - return ERR_PTR(ret); - - ret = devm_add_action_or_reset(device, hmm_devmem_release, devmem); - if (ret) - return ERR_PTR(ret); + devmem->pagemap.type = MEMORY_DEVICE_PRIVATE; + devmem->pagemap.res = *devmem->resource; + devmem->pagemap.page_fault = hmm_devmem_fault; + devmem->pagemap.page_free = hmm_devmem_free; + devmem->pagemap.altmap_valid = false; + devmem->pagemap.ref = &devmem->ref; + devmem->pagemap.data = devmem; + devmem->pagemap.kill = hmm_devmem_ref_kill; + result = devm_memremap_pages(devmem->device, &devmem->pagemap); + if (IS_ERR(result)) + return result; return devmem; } EXPORT_SYMBOL(hmm_devmem_add); @@ -1261,6 +1117,7 @@ struct hmm_devmem *hmm_devmem_add_resour struct resource *res) { struct hmm_devmem *devmem; + void *result; int ret; if (res->desc != IORES_DESC_DEVICE_PUBLIC_MEMORY) @@ -1293,19 +1150,18 @@ struct hmm_devmem *hmm_devmem_add_resour devmem->pfn_last = devmem->pfn_first + (resource_size(devmem->resource) >> PAGE_SHIFT); - ret = hmm_devmem_pages_create(devmem); - if (ret) - return ERR_PTR(ret); - - ret = devm_add_action_or_reset(device, hmm_devmem_release, devmem); - if (ret) - return ERR_PTR(ret); - - ret = devm_add_action_or_reset(device, hmm_devmem_ref_kill, - &devmem->ref); - if (ret) - return ERR_PTR(ret); + devmem->pagemap.type = MEMORY_DEVICE_PUBLIC; + devmem->pagemap.res = *devmem->resource; + devmem->pagemap.page_fault = hmm_devmem_fault; + devmem->pagemap.page_free = hmm_devmem_free; + devmem->pagemap.altmap_valid = false; + devmem->pagemap.ref = &devmem->ref; + devmem->pagemap.data = devmem; + devmem->pagemap.kill = hmm_devmem_ref_kill; + result = devm_memremap_pages(devmem->device, &devmem->pagemap); + if (IS_ERR(result)) + return result; return devmem; } EXPORT_SYMBOL(hmm_devmem_add_resource); _ Patches currently in -mm which might be from dan.j.williams(a)intel.com are

6 years, 8 months

1
0
0 0

[merged] mm-hmm-use-devm-semantics-for-hmm_devmem_add-remove.patch removed from -mm tree

by akpm＠linux-foundation.org

The patch titled Subject: mm, hmm: use devm semantics for hmm_devmem_{add, remove} has been removed from the -mm tree. Its filename was mm-hmm-use-devm-semantics-for-hmm_devmem_add-remove.patch This patch was dropped because it was merged into mainline or a subsystem tree ------------------------------------------------------ From: Dan Williams <dan.j.williams(a)intel.com> Subject: mm, hmm: use devm semantics for hmm_devmem_{add, remove} devm semantics arrange for resources to be torn down when device-driver-probe fails or when device-driver-release completes. Similar to devm_memremap_pages() there is no need to support an explicit remove operation when the users properly adhere to devm semantics. Note that devm_kzalloc() automatically handles allocating node-local memory. Link: http://lkml.kernel.org/r/154275559545.76910.9186690723515469051.stgit@dwill… Signed-off-by: Dan Williams <dan.j.williams(a)intel.com> Reviewed-by: Christoph Hellwig <hch(a)lst.de> Reviewed-by: Jérôme Glisse <jglisse(a)redhat.com> Cc: "Jérôme Glisse" <jglisse(a)redhat.com> Cc: Logan Gunthorpe <logang(a)deltatee.com> Cc: Balbir Singh <bsingharora(a)gmail.com> Cc: Michal Hocko <mhocko(a)suse.com> Cc: <stable(a)vger.kernel.org> Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org> --- --- a/include/linux/hmm.h~mm-hmm-use-devm-semantics-for-hmm_devmem_add-remove +++ a/include/linux/hmm.h @@ -512,8 +512,7 @@ struct hmm_devmem { * enough and allocate struct page for it. * * The device driver can wrap the hmm_devmem struct inside a private device - * driver struct. The device driver must call hmm_devmem_remove() before the - * device goes away and before freeing the hmm_devmem struct memory. + * driver struct. */ struct hmm_devmem *hmm_devmem_add(const struct hmm_devmem_ops *ops, struct device *device, @@ -521,7 +520,6 @@ struct hmm_devmem *hmm_devmem_add(const struct hmm_devmem *hmm_devmem_add_resource(const struct hmm_devmem_ops *ops, struct device *device, struct resource *res); -void hmm_devmem_remove(struct hmm_devmem *devmem); /* * hmm_devmem_page_set_drvdata - set per-page driver data field --- a/mm/hmm.c~mm-hmm-use-devm-semantics-for-hmm_devmem_add-remove +++ a/mm/hmm.c @@ -987,7 +987,6 @@ static void hmm_devmem_ref_exit(void *da devmem = container_of(ref, struct hmm_devmem, ref); percpu_ref_exit(ref); - devm_remove_action(devmem->device, &hmm_devmem_ref_exit, data); } static void hmm_devmem_ref_kill(void *data) @@ -998,7 +997,6 @@ static void hmm_devmem_ref_kill(void *da devmem = container_of(ref, struct hmm_devmem, ref); percpu_ref_kill(ref); wait_for_completion(&devmem->completion); - devm_remove_action(devmem->device, &hmm_devmem_ref_kill, data); } static int hmm_devmem_fault(struct vm_area_struct *vma, @@ -1036,7 +1034,7 @@ static void hmm_devmem_radix_release(str mutex_unlock(&hmm_devmem_lock); } -static void hmm_devmem_release(struct device *dev, void *data) +static void hmm_devmem_release(void *data) { struct hmm_devmem *devmem = data; struct resource *resource = devmem->resource; @@ -1044,11 +1042,6 @@ static void hmm_devmem_release(struct de struct zone *zone; struct page *page; - if (percpu_ref_tryget_live(&devmem->ref)) { - dev_WARN(dev, "%s: page mapping is still live!\n", __func__); - percpu_ref_put(&devmem->ref); - } - /* pages are dead and unused, undo the arch mapping */ start_pfn = (resource->start & ~(PA_SECTION_SIZE - 1)) >> PAGE_SHIFT; npages = ALIGN(resource_size(resource), PA_SECTION_SIZE) >> PAGE_SHIFT; @@ -1174,19 +1167,6 @@ error: return ret; } -static int hmm_devmem_match(struct device *dev, void *data, void *match_data) -{ - struct hmm_devmem *devmem = data; - - return devmem->resource == match_data; -} - -static void hmm_devmem_pages_remove(struct hmm_devmem *devmem) -{ - devres_release(devmem->device, &hmm_devmem_release, - &hmm_devmem_match, devmem->resource); -} - /* * hmm_devmem_add() - hotplug ZONE_DEVICE memory for device memory * @@ -1214,8 +1194,7 @@ struct hmm_devmem *hmm_devmem_add(const dev_pagemap_get_ops(); - devmem = devres_alloc_node(&hmm_devmem_release, sizeof(*devmem), - GFP_KERNEL, dev_to_node(device)); + devmem = devm_kzalloc(device, sizeof(*devmem), GFP_KERNEL); if (!devmem) return ERR_PTR(-ENOMEM); @@ -1229,11 +1208,11 @@ struct hmm_devmem *hmm_devmem_add(const ret = percpu_ref_init(&devmem->ref, &hmm_devmem_ref_release, 0, GFP_KERNEL); if (ret) - goto error_percpu_ref; + return ERR_PTR(ret); - ret = devm_add_action(device, hmm_devmem_ref_exit, &devmem->ref); + ret = devm_add_action_or_reset(device, hmm_devmem_ref_exit, &devmem->ref); if (ret) - goto error_devm_add_action; + return ERR_PTR(ret); size = ALIGN(size, PA_SECTION_SIZE); addr = min((unsigned long)iomem_resource.end, @@ -1253,16 +1232,12 @@ struct hmm_devmem *hmm_devmem_add(const devmem->resource = devm_request_mem_region(device, addr, size, dev_name(device)); - if (!devmem->resource) { - ret = -ENOMEM; - goto error_no_resource; - } + if (!devmem->resource) + return ERR_PTR(-ENOMEM); break; } - if (!devmem->resource) { - ret = -ERANGE; - goto error_no_resource; - } + if (!devmem->resource) + return ERR_PTR(-ERANGE); devmem->resource->desc = IORES_DESC_DEVICE_PRIVATE_MEMORY; devmem->pfn_first = devmem->resource->start >> PAGE_SHIFT; @@ -1271,28 +1246,13 @@ struct hmm_devmem *hmm_devmem_add(const ret = hmm_devmem_pages_create(devmem); if (ret) - goto error_pages; - - devres_add(device, devmem); + return ERR_PTR(ret); - ret = devm_add_action(device, hmm_devmem_ref_kill, &devmem->ref); - if (ret) { - hmm_devmem_remove(devmem); + ret = devm_add_action_or_reset(device, hmm_devmem_release, devmem); + if (ret) return ERR_PTR(ret); - } return devmem; - -error_pages: - devm_release_mem_region(device, devmem->resource->start, - resource_size(devmem->resource)); -error_no_resource: -error_devm_add_action: - hmm_devmem_ref_kill(&devmem->ref); - hmm_devmem_ref_exit(&devmem->ref); -error_percpu_ref: - devres_free(devmem); - return ERR_PTR(ret); } EXPORT_SYMBOL(hmm_devmem_add); @@ -1308,8 +1268,7 @@ struct hmm_devmem *hmm_devmem_add_resour dev_pagemap_get_ops(); - devmem = devres_alloc_node(&hmm_devmem_release, sizeof(*devmem), - GFP_KERNEL, dev_to_node(device)); + devmem = devm_kzalloc(device, sizeof(*devmem), GFP_KERNEL); if (!devmem) return ERR_PTR(-ENOMEM); @@ -1323,12 +1282,12 @@ struct hmm_devmem *hmm_devmem_add_resour ret = percpu_ref_init(&devmem->ref, &hmm_devmem_ref_release, 0, GFP_KERNEL); if (ret) - goto error_percpu_ref; + return ERR_PTR(ret); - ret = devm_add_action(device, hmm_devmem_ref_exit, &devmem->ref); + ret = devm_add_action_or_reset(device, hmm_devmem_ref_exit, + &devmem->ref); if (ret) - goto error_devm_add_action; - + return ERR_PTR(ret); devmem->pfn_first = devmem->resource->start >> PAGE_SHIFT; devmem->pfn_last = devmem->pfn_first + @@ -1336,60 +1295,22 @@ struct hmm_devmem *hmm_devmem_add_resour ret = hmm_devmem_pages_create(devmem); if (ret) - goto error_devm_add_action; + return ERR_PTR(ret); - devres_add(device, devmem); + ret = devm_add_action_or_reset(device, hmm_devmem_release, devmem); + if (ret) + return ERR_PTR(ret); - ret = devm_add_action(device, hmm_devmem_ref_kill, &devmem->ref); - if (ret) { - hmm_devmem_remove(devmem); + ret = devm_add_action_or_reset(device, hmm_devmem_ref_kill, + &devmem->ref); + if (ret) return ERR_PTR(ret); - } return devmem; - -error_devm_add_action: - hmm_devmem_ref_kill(&devmem->ref); - hmm_devmem_ref_exit(&devmem->ref); -error_percpu_ref: - devres_free(devmem); - return ERR_PTR(ret); } EXPORT_SYMBOL(hmm_devmem_add_resource); /* - * hmm_devmem_remove() - remove device memory (kill and free ZONE_DEVICE) - * - * @devmem: hmm_devmem struct use to track and manage the ZONE_DEVICE memory - * - * This will hot-unplug memory that was hotplugged by hmm_devmem_add on behalf - * of the device driver. It will free struct page and remove the resource that - * reserved the physical address range for this device memory. - */ -void hmm_devmem_remove(struct hmm_devmem *devmem) -{ - resource_size_t start, size; - struct device *device; - bool cdm = false; - - if (!devmem) - return; - - device = devmem->device; - start = devmem->resource->start; - size = resource_size(devmem->resource); - - cdm = devmem->resource->desc == IORES_DESC_DEVICE_PUBLIC_MEMORY; - hmm_devmem_ref_kill(&devmem->ref); - hmm_devmem_ref_exit(&devmem->ref); - hmm_devmem_pages_remove(devmem); - - if (!cdm) - devm_release_mem_region(device, start, size); -} -EXPORT_SYMBOL(hmm_devmem_remove); - -/* * A device driver that wants to handle multiple devices memory through a * single fake device can use hmm_device to do so. This is purely a helper * and it is not needed to make use of any HMM functionality. _ Patches currently in -mm which might be from dan.j.williams(a)intel.com are

6 years, 8 months

1
0
0 0

[merged] mm-devm_memremap_pages-add-memory_device_private-support.patch removed from -mm tree

by akpm＠linux-foundation.org

The patch titled Subject: mm, devm_memremap_pages: add MEMORY_DEVICE_PRIVATE support has been removed from the -mm tree. Its filename was mm-devm_memremap_pages-add-memory_device_private-support.patch This patch was dropped because it was merged into mainline or a subsystem tree ------------------------------------------------------ From: Dan Williams <dan.j.williams(a)intel.com> Subject: mm, devm_memremap_pages: add MEMORY_DEVICE_PRIVATE support In preparation for consolidating all ZONE_DEVICE enabling via devm_memremap_pages(), teach it how to handle the constraints of MEMORY_DEVICE_PRIVATE ranges. [jglisse(a)redhat.com: call move_pfn_range_to_zone for MEMORY_DEVICE_PRIVATE] Link: http://lkml.kernel.org/r/154275559036.76910.12434636179931292607.stgit@dwil… Signed-off-by: Dan Williams <dan.j.williams(a)intel.com> Reviewed-by: Jérôme Glisse <jglisse(a)redhat.com> Acked-by: Christoph Hellwig <hch(a)lst.de> Reported-by: Logan Gunthorpe <logang(a)deltatee.com> Reviewed-by: Logan Gunthorpe <logang(a)deltatee.com> Cc: Balbir Singh <bsingharora(a)gmail.com> Cc: Michal Hocko <mhocko(a)suse.com> Cc: <stable(a)vger.kernel.org> Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org> --- --- a/kernel/memremap.c~mm-devm_memremap_pages-add-memory_device_private-support +++ a/kernel/memremap.c @@ -98,9 +98,15 @@ static void devm_memremap_pages_release( - align_start; mem_hotplug_begin(); - arch_remove_memory(align_start, align_size, pgmap->altmap_valid ? - &pgmap->altmap : NULL); - kasan_remove_zero_shadow(__va(align_start), align_size); + if (pgmap->type == MEMORY_DEVICE_PRIVATE) { + pfn = align_start >> PAGE_SHIFT; + __remove_pages(page_zone(pfn_to_page(pfn)), pfn, + align_size >> PAGE_SHIFT, NULL); + } else { + arch_remove_memory(align_start, align_size, + pgmap->altmap_valid ? &pgmap->altmap : NULL); + kasan_remove_zero_shadow(__va(align_start), align_size); + } mem_hotplug_done(); untrack_pfn(NULL, PHYS_PFN(align_start), align_size); @@ -187,17 +193,40 @@ void *devm_memremap_pages(struct device goto err_pfn_remap; mem_hotplug_begin(); - error = kasan_add_zero_shadow(__va(align_start), align_size); - if (error) { - mem_hotplug_done(); - goto err_kasan; + + /* + * For device private memory we call add_pages() as we only need to + * allocate and initialize struct page for the device memory. More- + * over the device memory is un-accessible thus we do not want to + * create a linear mapping for the memory like arch_add_memory() + * would do. + * + * For all other device memory types, which are accessible by + * the CPU, we do want the linear mapping and thus use + * arch_add_memory(). + */ + if (pgmap->type == MEMORY_DEVICE_PRIVATE) { + error = add_pages(nid, align_start >> PAGE_SHIFT, + align_size >> PAGE_SHIFT, NULL, false); + } else { + error = kasan_add_zero_shadow(__va(align_start), align_size); + if (error) { + mem_hotplug_done(); + goto err_kasan; + } + + error = arch_add_memory(nid, align_start, align_size, altmap, + false); + } + + if (!error) { + struct zone *zone; + + zone = &NODE_DATA(nid)->node_zones[ZONE_DEVICE]; + move_pfn_range_to_zone(zone, align_start >> PAGE_SHIFT, + align_size >> PAGE_SHIFT, altmap); } - error = arch_add_memory(nid, align_start, align_size, altmap, false); - if (!error) - move_pfn_range_to_zone(&NODE_DATA(nid)->node_zones[ZONE_DEVICE], - align_start >> PAGE_SHIFT, - align_size >> PAGE_SHIFT, altmap); mem_hotplug_done(); if (error) goto err_add_memory; _ Patches currently in -mm which might be from dan.j.williams(a)intel.com are

6 years, 8 months

1
0
0 0

2025

2024

2023

2022

2021

2020

2019

2018

2017

Linux-stable-mirror