July 2023 - Linux-stable-mirror

[PATCH v2] Revert "usb: dwc3: core: Enable AutoRetry feature in the controller"

by Jakub Vanek

This reverts commit b138e23d3dff90c0494925b4c1874227b81bddf7. AutoRetry has been found to cause some issues. This feature allows the controller in host mode (further referred to as the xHC) to send non-terminating/burst retry ACKs (Retry=1 and Nump!=0) instead of terminating retry ACKs (Retry=1 and Nump=0) to devices when a transaction error occurs. Unfortunately, some USB devices fail to retry transactions when the xHC sends them a burst retry ACK. When this happens, the xHC enters a strange state. After the affected transfer times out, the xHCI driver tries to resume normal operation of the xHC by sending it a Stop Endpoint command. However, the xHC fails to respond to it, and the xHCI driver gives up. [1] This fact is reported via dmesg: [sda] tag#29 uas_eh_abort_handler 0 uas-tag 1 inflight: CMD IN [sda] tag#29 CDB: opcode=0x28 28 00 00 69 42 80 00 00 48 00 xhci-hcd: xHCI host not responding to stop endpoint command xhci-hcd: xHCI host controller not responding, assume dead xhci-hcd: HC died; cleaning up Some users observed this problem on an Odroid HC2 with the JMS578 USB3-to-SATA bridge. The issue can be triggered by starting a read-heavy workload on an attached SSD. After a while, the host controller would die and the SSD would disappear from the system. [1] Further analysis by Synopsys determined that controller revisions other than the one in Odroid HC2 are also affected by this. The recommended solution was to disable AutoRetry altogether. This change does not have a noticeable performance impact. [2] Fixes: b138e23d3dff ("usb: dwc3: core: Enable AutoRetry feature in the controller") Link: https://lore.kernel.org/r/a21f34c04632d250cd0a78c7c6f4a1c9c7a43142.camel@gm… [1] Link: https://lore.kernel.org/r/20230711214834.kyr6ulync32d4ktk@synopsys.com/ [2] Cc: stable(a)vger.kernel.org Cc: Mauro Ribeiro <mauro.ribeiro(a)hardkernel.com> Cc: Krzysztof Kozlowski <krzysztof.kozlowski(a)linaro.org> Suggested-by: Thinh Nguyen <Thinh.Nguyen(a)synopsys.com> Signed-off-by: Jakub Vanek <linuxtardis(a)gmail.com> --- V1 -> V2: Updated to disable AutoRetry everywhere based on Synopsys feedback Reworded the changelog a bit to make it clearer drivers/usb/dwc3/core.c | 16 ---------------- drivers/usb/dwc3/core.h | 3 --- 2 files changed, 19 deletions(-) diff --git a/drivers/usb/dwc3/core.c b/drivers/usb/dwc3/core.c index f6689b731718..a4e079d37566 100644 --- a/drivers/usb/dwc3/core.c +++ b/drivers/usb/dwc3/core.c @@ -1209,22 +1209,6 @@ static int dwc3_core_init(struct dwc3 *dwc) dwc3_writel(dwc->regs, DWC3_GUCTL1, reg); } - if (dwc->dr_mode == USB_DR_MODE_HOST || - dwc->dr_mode == USB_DR_MODE_OTG) { - reg = dwc3_readl(dwc->regs, DWC3_GUCTL); - - /* - * Enable Auto retry Feature to make the controller operating in - * Host mode on seeing transaction errors(CRC errors or internal - * overrun scenerios) on IN transfers to reply to the device - * with a non-terminating retry ACK (i.e, an ACK transcation - * packet with Retry=1 & Nump != 0) - */ - reg |= DWC3_GUCTL_HSTINAUTORETRY; - - dwc3_writel(dwc->regs, DWC3_GUCTL, reg); - } - /* * Must config both number of packets and max burst settings to enable * RX and/or TX threshold. diff --git a/drivers/usb/dwc3/core.h b/drivers/usb/dwc3/core.h index 8b1295e4dcdd..a69ac67d89fe 100644 --- a/drivers/usb/dwc3/core.h +++ b/drivers/usb/dwc3/core.h @@ -256,9 +256,6 @@ #define DWC3_GCTL_GBLHIBERNATIONEN BIT(1) #define DWC3_GCTL_DSBLCLKGTNG BIT(0) -/* Global User Control Register */ -#define DWC3_GUCTL_HSTINAUTORETRY BIT(14) - /* Global User Control 1 Register */ #define DWC3_GUCTL1_DEV_DECOUPLE_L1L2_EVT BIT(31) #define DWC3_GUCTL1_TX_IPGAP_LINECHECK_DIS BIT(28) -- 2.25.1

2 years

3
4
0 0

[PATCH] hwmon: (aquacomputer_d5next) Fix incorrect PWM value readout

by Aleksa Savic

Commit 662d20b3a5af ("hwmon: (aquacomputer_d5next) Add support for temperature sensor offsets") changed aqc_get_ctrl_val() to return the value through a parameter instead of through the return value, but didn't fix up a case that relied on the old behavior. Fix it to use the proper received value and not the return code. Fixes: 662d20b3a5af ("hwmon: (aquacomputer_d5next) Add support for temperature sensor offsets") Cc: stable(a)vger.kernel.org Signed-off-by: Aleksa Savic <savicaleksa83(a)gmail.com> --- drivers/hwmon/aquacomputer_d5next.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/hwmon/aquacomputer_d5next.c b/drivers/hwmon/aquacomputer_d5next.c index a981f7086114..a997dbcb563f 100644 --- a/drivers/hwmon/aquacomputer_d5next.c +++ b/drivers/hwmon/aquacomputer_d5next.c @@ -1027,7 +1027,7 @@ static int aqc_read(struct device *dev, enum hwmon_sensor_types type, u32 attr, if (ret < 0) return ret; - *val = aqc_percent_to_pwm(ret); + *val = aqc_percent_to_pwm(*val); break; } break; -- 2.35.0

2 years

1
0
0 0

[PATCH v4 04/11] PCI/ASPM: Use RMW accessors for changing LNKCTL

by Ilpo Järvinen

Don't assume that the device is fully under the control of ASPM and use RMW capability accessors which do proper locking to avoid losing concurrent updates to the register values. If configuration fails in pcie_aspm_configure_common_clock(), the function attempts to restore the old PCI_EXP_LNKCTL_CCC settings. Store only the old PCI_EXP_LNKCTL_CCC bit for the relevant devices rather than the content of the whole LNKCTL registers. It aligns better with how pcie_lnkctl_clear_and_set() expects its parameter and makes the code more obvious to understand. Fixes: 2a42d9dba784 ("PCIe: ASPM: Break out of endless loop waiting for PCI config bits to switch") Fixes: 7d715a6c1ae5 ("PCI: add PCI Express ASPM support") Suggested-by: Lukas Wunner <lukas(a)wunner.de> Signed-off-by: Ilpo Järvinen <ilpo.jarvinen(a)linux.intel.com> Acked-by: Rafael J. Wysocki <rafael(a)kernel.org> Cc: stable(a)vger.kernel.org --- drivers/pci/pcie/aspm.c | 31 ++++++++++++++----------------- 1 file changed, 14 insertions(+), 17 deletions(-) diff --git a/drivers/pci/pcie/aspm.c b/drivers/pci/pcie/aspm.c index 3dafba0b5f41..207c247cba02 100644 --- a/drivers/pci/pcie/aspm.c +++ b/drivers/pci/pcie/aspm.c @@ -199,7 +199,7 @@ static void pcie_clkpm_cap_init(struct pcie_link_state *link, int blacklist) static void pcie_aspm_configure_common_clock(struct pcie_link_state *link) { int same_clock = 1; - u16 reg16, parent_reg, child_reg[8]; + u16 reg16, parent_old_ccc, child_old_ccc[8]; struct pci_dev *child, *parent = link->pdev; struct pci_bus *linkbus = parent->subordinate; /* @@ -221,6 +221,7 @@ static void pcie_aspm_configure_common_clock(struct pcie_link_state *link) /* Port might be already in common clock mode */ pcie_capability_read_word(parent, PCI_EXP_LNKCTL, &reg16); + parent_old_ccc = reg16 & PCI_EXP_LNKCTL_CCC; if (same_clock && (reg16 & PCI_EXP_LNKCTL_CCC)) { bool consistent = true; @@ -240,31 +241,27 @@ static void pcie_aspm_configure_common_clock(struct pcie_link_state *link) /* Configure downstream component, all functions */ list_for_each_entry(child, &linkbus->devices, bus_list) { pcie_capability_read_word(child, PCI_EXP_LNKCTL, &reg16); - child_reg[PCI_FUNC(child->devfn)] = reg16; - if (same_clock) - reg16 |= PCI_EXP_LNKCTL_CCC; - else - reg16 &= ~PCI_EXP_LNKCTL_CCC; - pcie_capability_write_word(child, PCI_EXP_LNKCTL, reg16); + child_old_ccc[PCI_FUNC(child->devfn)] = reg16 & PCI_EXP_LNKCTL_CCC; + pcie_capability_clear_and_set_word(child, PCI_EXP_LNKCTL, + PCI_EXP_LNKCTL_CCC, + same_clock ? PCI_EXP_LNKCTL_CCC : 0); } /* Configure upstream component */ - pcie_capability_read_word(parent, PCI_EXP_LNKCTL, &reg16); - parent_reg = reg16; - if (same_clock) - reg16 |= PCI_EXP_LNKCTL_CCC; - else - reg16 &= ~PCI_EXP_LNKCTL_CCC; - pcie_capability_write_word(parent, PCI_EXP_LNKCTL, reg16); + pcie_capability_clear_and_set_word(parent, PCI_EXP_LNKCTL, + PCI_EXP_LNKCTL_CCC, + same_clock ? PCI_EXP_LNKCTL_CCC : 0); if (pcie_retrain_link(link->pdev, true)) { /* Training failed. Restore common clock configurations */ pci_err(parent, "ASPM: Could not configure common clock\n"); list_for_each_entry(child, &linkbus->devices, bus_list) - pcie_capability_write_word(child, PCI_EXP_LNKCTL, - child_reg[PCI_FUNC(child->devfn)]); - pcie_capability_write_word(parent, PCI_EXP_LNKCTL, parent_reg); + pcie_capability_clear_and_set_word(child, PCI_EXP_LNKCTL, + PCI_EXP_LNKCTL_CCC, + child_old_ccc[PCI_FUNC(child->devfn)]); + pcie_capability_clear_and_set_word(parent, PCI_EXP_LNKCTL, + PCI_EXP_LNKCTL_CCC, parent_old_ccc); } } -- 2.30.2

2 years

2
1
0 0

[PATCH v4 03/11] PCI: pciehp: Use RMW accessors for changing LNKCTL

by Ilpo Järvinen

As hotplug is not the only driver touching LNKCTL, use the RMW capability accessor which handles concurrent changes correctly. Fixes: 7f822999e12a ("PCI: pciehp: Add Disable/enable link functions") Suggested-by: Lukas Wunner <lukas(a)wunner.de> Signed-off-by: Ilpo Järvinen <ilpo.jarvinen(a)linux.intel.com> Acked-by: Rafael J. Wysocki <rafael(a)kernel.org> Cc: stable(a)vger.kernel.org --- drivers/pci/hotplug/pciehp_hpc.c | 12 +++--------- 1 file changed, 3 insertions(+), 9 deletions(-) diff --git a/drivers/pci/hotplug/pciehp_hpc.c b/drivers/pci/hotplug/pciehp_hpc.c index 8711325605f0..e9ec77d8d44a 100644 --- a/drivers/pci/hotplug/pciehp_hpc.c +++ b/drivers/pci/hotplug/pciehp_hpc.c @@ -332,17 +332,11 @@ int pciehp_check_link_status(struct controller *ctrl) static int __pciehp_link_set(struct controller *ctrl, bool enable) { struct pci_dev *pdev = ctrl_dev(ctrl); - u16 lnk_ctrl; - pcie_capability_read_word(pdev, PCI_EXP_LNKCTL, &lnk_ctrl); + pcie_capability_clear_and_set_word(pdev, PCI_EXP_LNKCTL, + PCI_EXP_LNKCTL_LD, + !enable ? PCI_EXP_LNKCTL_LD : 0); - if (enable) - lnk_ctrl &= ~PCI_EXP_LNKCTL_LD; - else - lnk_ctrl |= PCI_EXP_LNKCTL_LD; - - pcie_capability_write_word(pdev, PCI_EXP_LNKCTL, lnk_ctrl); - ctrl_dbg(ctrl, "%s: lnk_ctrl = %x\n", __func__, lnk_ctrl); return 0; } -- 2.30.2

2 years

2
1
0 0

[PATCH 1/3] PM / wakeirq: fix wake irq arming

by Johan Hovold

The decision whether to enable a wake irq during suspend can not be done based on the runtime PM state directly as a driver may use wake irqs without implementing runtime PM. Such drivers specifically leave the state set to the default 'suspended' and the wake irq is thus never enabled at suspend. Add a new wake irq flag to track whether a dedicated wake irq has been enabled at runtime suspend and therefore must not be enabled at system suspend. Note that pm_runtime_enabled() can not be used as runtime PM is always disabled during late suspend. Fixes: 69728051f5bf ("PM / wakeirq: Fix unbalanced IRQ enable for wakeirq") Cc: stable(a)vger.kernel.org # 4.16 Cc: Tony Lindgren <tony(a)atomide.com> Signed-off-by: Johan Hovold <johan+linaro(a)kernel.org> --- drivers/base/power/power.h | 1 + drivers/base/power/wakeirq.c | 12 ++++++++---- 2 files changed, 9 insertions(+), 4 deletions(-) diff --git a/drivers/base/power/power.h b/drivers/base/power/power.h index 0eb7f02b3ad5..922ed457db19 100644 --- a/drivers/base/power/power.h +++ b/drivers/base/power/power.h @@ -29,6 +29,7 @@ extern u64 pm_runtime_active_time(struct device *dev); #define WAKE_IRQ_DEDICATED_MASK (WAKE_IRQ_DEDICATED_ALLOCATED | \ WAKE_IRQ_DEDICATED_MANAGED | \ WAKE_IRQ_DEDICATED_REVERSE) +#define WAKE_IRQ_DEDICATED_ENABLED BIT(3) struct wake_irq { struct device *dev; diff --git a/drivers/base/power/wakeirq.c b/drivers/base/power/wakeirq.c index d487a6bac630..afd094dec5ca 100644 --- a/drivers/base/power/wakeirq.c +++ b/drivers/base/power/wakeirq.c @@ -314,8 +314,10 @@ void dev_pm_enable_wake_irq_check(struct device *dev, return; enable: - if (!can_change_status || !(wirq->status & WAKE_IRQ_DEDICATED_REVERSE)) + if (!can_change_status || !(wirq->status & WAKE_IRQ_DEDICATED_REVERSE)) { enable_irq(wirq->irq); + wirq->status |= WAKE_IRQ_DEDICATED_ENABLED; + } } /** @@ -336,8 +338,10 @@ void dev_pm_disable_wake_irq_check(struct device *dev, bool cond_disable) if (cond_disable && (wirq->status & WAKE_IRQ_DEDICATED_REVERSE)) return; - if (wirq->status & WAKE_IRQ_DEDICATED_MANAGED) + if (wirq->status & WAKE_IRQ_DEDICATED_MANAGED) { + wirq->status &= ~WAKE_IRQ_DEDICATED_ENABLED; disable_irq_nosync(wirq->irq); + } } /** @@ -376,7 +380,7 @@ void dev_pm_arm_wake_irq(struct wake_irq *wirq) if (device_may_wakeup(wirq->dev)) { if (wirq->status & WAKE_IRQ_DEDICATED_ALLOCATED && - !pm_runtime_status_suspended(wirq->dev)) + !(wirq->status & WAKE_IRQ_DEDICATED_ENABLED)) enable_irq(wirq->irq); enable_irq_wake(wirq->irq); @@ -399,7 +403,7 @@ void dev_pm_disarm_wake_irq(struct wake_irq *wirq) disable_irq_wake(wirq->irq); if (wirq->status & WAKE_IRQ_DEDICATED_ALLOCATED && - !pm_runtime_status_suspended(wirq->dev)) + !(wirq->status & WAKE_IRQ_DEDICATED_ENABLED)) disable_irq_nosync(wirq->irq); } } -- 2.41.0

2 years

2
1
0 0

[PATCH 3/3] serial: qcom-geni: drop bogus runtime pm state update

by Johan Hovold

The runtime PM state should not be changed by drivers that do not implement runtime PM even if it happens to work around a bug in PM core. With the wake irq arming now fixed, drop the bogus runtime PM state update which left the device in active state (and could potentially prevent a parent device from suspending). Fixes: f3974413cf02 ("tty: serial: qcom_geni_serial: Wakeup IRQ cleanup") Cc: stable(a)vger.kernel.org # 5.6 Cc: Matthias Kaehlcke <mka(a)chromium.org> Cc: Stephen Boyd <swboyd(a)chromium.org> Signed-off-by: Johan Hovold <johan+linaro(a)kernel.org> --- drivers/tty/serial/qcom_geni_serial.c | 7 ------- 1 file changed, 7 deletions(-) diff --git a/drivers/tty/serial/qcom_geni_serial.c b/drivers/tty/serial/qcom_geni_serial.c index 88ed5bbe25a8..b825b05e6137 100644 --- a/drivers/tty/serial/qcom_geni_serial.c +++ b/drivers/tty/serial/qcom_geni_serial.c @@ -1681,13 +1681,6 @@ static int qcom_geni_serial_probe(struct platform_device *pdev) if (ret) return ret; - /* - * Set pm_runtime status as ACTIVE so that wakeup_irq gets - * enabled/disabled from dev_pm_arm_wake_irq during system - * suspend/resume respectively. - */ - pm_runtime_set_active(&pdev->dev); - if (port->wakeup_irq > 0) { device_init_wakeup(&pdev->dev, true); ret = dev_pm_set_dedicated_wake_irq(&pdev->dev, -- 2.41.0

2 years

2
1
0 0

[PATCH 1/2] KVM: Avoid illegal stage2 mapping on invalid memory slot

by tien.sung.ang＠intel.com

From: Gavin Shan <gshan(a)redhat.com> We run into guest hang in edk2 firmware when KSM is kept as running on the host. The edk2 firmware is waiting for status 0x80 from QEMU's pflash device (TYPE_PFLASH_CFI01) during the operation of sector erasing or buffered write. The status is returned by reading the memory region of the pflash device and the read request should have been forwarded to QEMU and emulated by it. Unfortunately, the read request is covered by an illegal stage2 mapping when the guest hang issue occurs. The read request is completed with QEMU bypassed and wrong status is fetched. The edk2 firmware runs into an infinite loop with the wrong status. The illegal stage2 mapping is populated due to same page sharing by KSM at (C) even the associated memory slot has been marked as invalid at (B) when the memory slot is requested to be deleted. It's notable that the active and inactive memory slots can't be swapped when we're in the middle of kvm_mmu_notifier_change_pte() because kvm->mn_active_invalidate_count is elevated, and kvm_swap_active_memslots() will busy loop until it reaches to zero again. Besides, the swapping from the active to the inactive memory slots is also avoided by holding &kvm->srcu in __kvm_handle_hva_range(), corresponding to synchronize_srcu_expedited() in kvm_swap_active_memslots(). CPU-A CPU-B ----- ----- ioctl(kvm_fd, KVM_SET_USER_MEMORY_REGION) kvm_vm_ioctl_set_memory_region kvm_set_memory_region __kvm_set_memory_region kvm_set_memslot(kvm, old, NULL, KVM_MR_DELETE) kvm_invalidate_memslot kvm_copy_memslot kvm_replace_memslot kvm_swap_active_memslots (A) kvm_arch_flush_shadow_memslot (B) same page sharing by KSM kvm_mmu_notifier_invalidate_range_start : kvm_mmu_notifier_change_pte kvm_handle_hva_range __kvm_handle_hva_range kvm_set_spte_gfn (C) : kvm_mmu_notifier_invalidate_range_end Fix the issue by skipping the invalid memory slot at (C) to avoid the illegal stage2 mapping so that the read request for the pflash's status is forwarded to QEMU and emulated by it. In this way, the correct pflash's status can be returned from QEMU to break the infinite loop in the edk2 firmware. We tried a git-bisect and the first problematic commit is cd4c71835228 (" KVM: arm64: Convert to the gfn-based MMU notifier callbacks"). With this, clean_dcache_guest_page() is called after the memory slots are iterated in kvm_mmu_notifier_change_pte(). clean_dcache_guest_page() is called before the iteration on the memory slots before this commit. This change literally enlarges the racy window between kvm_mmu_notifier_change_pte() and memory slot removal so that we're able to reproduce the issue in a practical test case. However, the issue exists since commit d5d8184d35c9 ("KVM: ARM: Memory virtualization setup"). Cc: stable(a)vger.kernel.org # v3.9+ Fixes: d5d8184d35c9 ("KVM: ARM: Memory virtualization setup") Reported-by: Shuai Hu <hshuai(a)redhat.com> Reported-by: Zhenyu Zhang <zhenyzha(a)redhat.com> Signed-off-by: Gavin Shan <gshan(a)redhat.com> Reviewed-by: David Hildenbrand <david(a)redhat.com> Reviewed-by: Oliver Upton <oliver.upton(a)linux.dev> Reviewed-by: Peter Xu <peterx(a)redhat.com> Reviewed-by: Sean Christopherson <seanjc(a)google.com> Reviewed-by: Shaoqin Huang <shahuang(a)redhat.com> Message-Id: <20230615054259.14911-1-gshan(a)redhat.com> Signed-off-by: Paolo Bonzini <pbonzini(a)redhat.com> --- virt/kvm/kvm_main.c | 20 +++++++++++++++++++- 1 file changed, 19 insertions(+), 1 deletion(-) diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c index 479802a892d4..65f94f592ff8 100644 --- a/virt/kvm/kvm_main.c +++ b/virt/kvm/kvm_main.c @@ -686,6 +686,24 @@ static __always_inline int kvm_handle_hva_range_no_flush(struct mmu_notifier *mn return __kvm_handle_hva_range(kvm, &range); } + +static bool kvm_change_spte_gfn(struct kvm *kvm, struct kvm_gfn_range *range) +{ + /* + * Skipping invalid memslots is correct if and only change_pte() is + * surrounded by invalidate_range_{start,end}(), which is currently + * guaranteed by the primary MMU. If that ever changes, KVM needs to + * unmap the memslot instead of skipping the memslot to ensure that KVM + * doesn't hold references to the old PFN. + */ + WARN_ON_ONCE(!READ_ONCE(kvm->mn_active_invalidate_count)); + + if (range->slot->flags & KVM_MEMSLOT_INVALID) + return false; + + return kvm_set_spte_gfn(kvm, range); +} + static void kvm_mmu_notifier_change_pte(struct mmu_notifier *mn, struct mm_struct *mm, unsigned long address, @@ -707,7 +725,7 @@ static void kvm_mmu_notifier_change_pte(struct mmu_notifier *mn, if (!READ_ONCE(kvm->mmu_invalidate_in_progress)) return; - kvm_handle_hva_range(mn, address, address + 1, pte, kvm_set_spte_gfn); + kvm_handle_hva_range(mn, address, address + 1, pte, kvm_change_spte_gfn); } void kvm_mmu_invalidate_begin(struct kvm *kvm, unsigned long start, -- 2.25.1

2 years

2
1
0 0

[PATCH 1/2] KVM: Avoid illegal stage2 mapping on invalid memory slot

by tien.sung.ang＠intel.com

From: Gavin Shan <gshan(a)redhat.com> We run into guest hang in edk2 firmware when KSM is kept as running on the host. The edk2 firmware is waiting for status 0x80 from QEMU's pflash device (TYPE_PFLASH_CFI01) during the operation of sector erasing or buffered write. The status is returned by reading the memory region of the pflash device and the read request should have been forwarded to QEMU and emulated by it. Unfortunately, the read request is covered by an illegal stage2 mapping when the guest hang issue occurs. The read request is completed with QEMU bypassed and wrong status is fetched. The edk2 firmware runs into an infinite loop with the wrong status. The illegal stage2 mapping is populated due to same page sharing by KSM at (C) even the associated memory slot has been marked as invalid at (B) when the memory slot is requested to be deleted. It's notable that the active and inactive memory slots can't be swapped when we're in the middle of kvm_mmu_notifier_change_pte() because kvm->mn_active_invalidate_count is elevated, and kvm_swap_active_memslots() will busy loop until it reaches to zero again. Besides, the swapping from the active to the inactive memory slots is also avoided by holding &kvm->srcu in __kvm_handle_hva_range(), corresponding to synchronize_srcu_expedited() in kvm_swap_active_memslots(). CPU-A CPU-B ----- ----- ioctl(kvm_fd, KVM_SET_USER_MEMORY_REGION) kvm_vm_ioctl_set_memory_region kvm_set_memory_region __kvm_set_memory_region kvm_set_memslot(kvm, old, NULL, KVM_MR_DELETE) kvm_invalidate_memslot kvm_copy_memslot kvm_replace_memslot kvm_swap_active_memslots (A) kvm_arch_flush_shadow_memslot (B) same page sharing by KSM kvm_mmu_notifier_invalidate_range_start : kvm_mmu_notifier_change_pte kvm_handle_hva_range __kvm_handle_hva_range kvm_set_spte_gfn (C) : kvm_mmu_notifier_invalidate_range_end Fix the issue by skipping the invalid memory slot at (C) to avoid the illegal stage2 mapping so that the read request for the pflash's status is forwarded to QEMU and emulated by it. In this way, the correct pflash's status can be returned from QEMU to break the infinite loop in the edk2 firmware. We tried a git-bisect and the first problematic commit is cd4c71835228 (" KVM: arm64: Convert to the gfn-based MMU notifier callbacks"). With this, clean_dcache_guest_page() is called after the memory slots are iterated in kvm_mmu_notifier_change_pte(). clean_dcache_guest_page() is called before the iteration on the memory slots before this commit. This change literally enlarges the racy window between kvm_mmu_notifier_change_pte() and memory slot removal so that we're able to reproduce the issue in a practical test case. However, the issue exists since commit d5d8184d35c9 ("KVM: ARM: Memory virtualization setup"). Cc: stable(a)vger.kernel.org # v3.9+ Fixes: d5d8184d35c9 ("KVM: ARM: Memory virtualization setup") Reported-by: Shuai Hu <hshuai(a)redhat.com> Reported-by: Zhenyu Zhang <zhenyzha(a)redhat.com> Signed-off-by: Gavin Shan <gshan(a)redhat.com> Reviewed-by: David Hildenbrand <david(a)redhat.com> Reviewed-by: Oliver Upton <oliver.upton(a)linux.dev> Reviewed-by: Peter Xu <peterx(a)redhat.com> Reviewed-by: Sean Christopherson <seanjc(a)google.com> Reviewed-by: Shaoqin Huang <shahuang(a)redhat.com> Message-Id: <20230615054259.14911-1-gshan(a)redhat.com> Signed-off-by: Paolo Bonzini <pbonzini(a)redhat.com> --- virt/kvm/kvm_main.c | 20 +++++++++++++++++++- 1 file changed, 19 insertions(+), 1 deletion(-) diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c index 479802a892d4..65f94f592ff8 100644 --- a/virt/kvm/kvm_main.c +++ b/virt/kvm/kvm_main.c @@ -686,6 +686,24 @@ static __always_inline int kvm_handle_hva_range_no_flush(struct mmu_notifier *mn return __kvm_handle_hva_range(kvm, &range); } + +static bool kvm_change_spte_gfn(struct kvm *kvm, struct kvm_gfn_range *range) +{ + /* + * Skipping invalid memslots is correct if and only change_pte() is + * surrounded by invalidate_range_{start,end}(), which is currently + * guaranteed by the primary MMU. If that ever changes, KVM needs to + * unmap the memslot instead of skipping the memslot to ensure that KVM + * doesn't hold references to the old PFN. + */ + WARN_ON_ONCE(!READ_ONCE(kvm->mn_active_invalidate_count)); + + if (range->slot->flags & KVM_MEMSLOT_INVALID) + return false; + + return kvm_set_spte_gfn(kvm, range); +} + static void kvm_mmu_notifier_change_pte(struct mmu_notifier *mn, struct mm_struct *mm, unsigned long address, @@ -707,7 +725,7 @@ static void kvm_mmu_notifier_change_pte(struct mmu_notifier *mn, if (!READ_ONCE(kvm->mmu_invalidate_in_progress)) return; - kvm_handle_hva_range(mn, address, address + 1, pte, kvm_set_spte_gfn); + kvm_handle_hva_range(mn, address, address + 1, pte, kvm_change_spte_gfn); } void kvm_mmu_invalidate_begin(struct kvm *kvm, unsigned long start, -- 2.25.1

2 years

2
1
0 0

[RFC PATCH 2/3] memfd: remove racheting feature from vm.memfd_noexec

by Aleksa Sarai

This sysctl has the very unusal behaviour of not allowing any user (even CAP_SYS_ADMIN) to reduce the restriction setting, meaning that if you were to set this sysctl to a more restrictive option in the host pidns you would need to reboot your machine in order to reset it. The justification given in [1] is that this is a security feature and thus it should not be possible to disable. Aside from the fact that we have plenty of security-related sysctls that can be disabled after being enabled (fs.protected_symlinks for instance), the protection provided by the sysctl is to stop users from being able to create a binary and then execute it. A user with CAP_SYS_ADMIN can trivially do this without memfd_create(2): % cat mount-memfd.c #include <fcntl.h> #include <string.h> #include <stdio.h> #include <stdlib.h> #include <unistd.h> #include <linux/mount.h> #define SHELLCODE "#!/bin/echo this file was executed from this totally private tmpfs:" int main(void) { int fsfd = fsopen("tmpfs", FSOPEN_CLOEXEC); assert(fsfd >= 0); assert(!fsconfig(fsfd, FSCONFIG_CMD_CREATE, NULL, NULL, 2)); int dfd = fsmount(fsfd, FSMOUNT_CLOEXEC, 0); assert(dfd >= 0); int execfd = openat(dfd, "exe", O_CREAT | O_RDWR | O_CLOEXEC, 0782); assert(execfd >= 0); assert(write(execfd, SHELLCODE, strlen(SHELLCODE)) == strlen(SHELLCODE)); assert(!close(execfd)); char *execpath = NULL; char *argv[] = { "bad-exe", NULL }, *envp[] = { NULL }; execfd = openat(dfd, "exe", O_PATH | O_CLOEXEC); assert(execfd >= 0); assert(asprintf(&execpath, "/proc/self/fd/%d", execfd) > 0); assert(!execve(execpath, argv, envp)); } % ./mount-memfd this file was executed from this totally private tmpfs: /proc/self/fd/5 % Given that it is possible for CAP_SYS_ADMIN users to create executable binaries without memfd_create(2) and without touching the host filesystem (not to mention the many other things a CAP_SYS_ADMIN process would be able to do that would be equivalent or worse), it seems strange to cause a fair amount of headache to admins when there doesn't appear to be an actual security benefit to blocking this. It should be noted that with this change, programs that can do an unprivileged unshare(CLONE_NEWUSER) would be able to create an executable memfd even if their current pidns didn't allow it. However, the same sample program above can also be used in this scenario, meaning that even with this consideration, blocking CAP_SYS_ADMIN makes little sense: % unshare -rm ./mount-memfd this file was executed from this totally private tmpfs: /proc/self/fd/5 This simply further reinforces that locked-down environments need to disallow CLONE_NEWUSER for unprivileged users (as is already the case in most container environments). [1]: https://lore.kernel.org/all/CABi2SkWnAgHK1i6iqSqPMYuNEhtHBkO8jUuCvmG3RmUB5T… Cc: Dominique Martinet <asmadeus(a)codewreck.org> Cc: Christian Brauner <brauner(a)kernel.org> Cc: stable(a)vger.kernel.org # v6.3+ Fixes: 105ff5339f49 ("mm/memfd: add MFD_NOEXEC_SEAL and MFD_EXEC") Signed-off-by: Aleksa Sarai <cyphar(a)cyphar.com> --- kernel/pid_sysctl.h | 7 ------- 1 file changed, 7 deletions(-) diff --git a/kernel/pid_sysctl.h b/kernel/pid_sysctl.h index b26e027fc9cd..8a22bc29ebb4 100644 --- a/kernel/pid_sysctl.h +++ b/kernel/pid_sysctl.h @@ -24,13 +24,6 @@ static int pid_mfd_noexec_dointvec_minmax(struct ctl_table *table, if (ns != &init_pid_ns) table_copy.data = &ns->memfd_noexec_scope; - /* - * set minimum to current value, the effect is only bigger - * value is accepted. - */ - if (*(int *)table_copy.data > *(int *)table_copy.extra1) - table_copy.extra1 = table_copy.data; - return proc_dointvec_minmax(&table_copy, write, buf, lenp, ppos); } -- 2.41.0

2 years

1
1
0 0

[to-be-updated] hugetlb-optimize-update_and_free_pages_bulk-to-avoid-lock-cycles.patch removed from -mm tree

by Andrew Morton

The quilt patch titled Subject: hugetlb: optimize update_and_free_pages_bulk to avoid lock cycles has been removed from the -mm tree. Its filename was hugetlb-optimize-update_and_free_pages_bulk-to-avoid-lock-cycles.patch This patch was dropped because an updated version will be merged ------------------------------------------------------ From: Mike Kravetz <mike.kravetz(a)oracle.com> Subject: hugetlb: optimize update_and_free_pages_bulk to avoid lock cycles Date: Tue, 11 Jul 2023 15:09:42 -0700 update_and_free_pages_bulk is designed to free a list of hugetlb pages back to their associated lower level allocators. This may require allocating vmemmap pages associated with each hugetlb page. The hugetlb page destructor must be changed before pages are freed to lower level allocators. However, the destructor must be changed under the hugetlb lock. This means there is potentially one lock cycle per page. Minimize the number of lock cycles in update_and_free_pages_bulk by: 1) allocating necessary vmemmap for all hugetlb pages on the list 2) take hugetlb lock and clear destructor for all pages on the list 3) free all pages on list back to low level allocators Link: https://lkml.kernel.org/r/20230711220942.43706-3-mike.kravetz@oracle.com Fixes: ad2fa3717b74 ("mm: hugetlb: alloc the vmemmap pages associated with each HugeTLB page") Signed-off-by: Mike Kravetz <mike.kravetz(a)oracle.com> Cc: Axel Rasmussen <axelrasmussen(a)google.com> Cc: James Houghton <jthoughton(a)google.com> Cc: Jiaqi Yan <jiaqiyan(a)google.com> Cc: Miaohe Lin <linmiaohe(a)huawei.com> Cc: Michal Hocko <mhocko(a)suse.com> Cc: Muchun Song <songmuchun(a)bytedance.com> Cc: Naoya Horiguchi <naoya.horiguchi(a)linux.dev> Cc: <stable(a)vger.kernel.org> Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org> --- mm/hugetlb.c | 35 ++++++++++++++++++++++++++++++++++- 1 file changed, 34 insertions(+), 1 deletion(-) --- a/mm/hugetlb.c~hugetlb-optimize-update_and_free_pages_bulk-to-avoid-lock-cycles +++ a/mm/hugetlb.c @@ -1855,11 +1855,44 @@ static void update_and_free_pages_bulk(s { struct page *page, *t_page; struct folio *folio; + bool clear_dtor = false; + /* + * First allocate required vmemmmap for all pages on list. If vmemmap + * can not be allocated, we can not free page to lower level allocator, + * so add back as hugetlb surplus page. + */ + list_for_each_entry_safe(page, t_page, list, lru) { + if (HPageVmemmapOptimized(page)) { + clear_dtor = true; + if (hugetlb_vmemmap_restore(h, page)) { + spin_lock_irq(&hugetlb_lock); + add_hugetlb_folio(h, folio, true); + spin_unlock_irq(&hugetlb_lock); + } + cond_resched(); + } + } + + /* + * If vmemmmap allocation performed above, then take lock * to clear + * destructor of all pages on list. + */ + if (clear_dtor) { + spin_lock_irq(&hugetlb_lock); + list_for_each_entry(page, list, lru) + __clear_hugetlb_destructor(h, page_folio(page)); + spin_unlock_irq(&hugetlb_lock); + } + + /* + * Free pages back to low level allocators. vmemmap and destructors + * were taken care of above, so update_and_free_hugetlb_folio will + * not need to take hugetlb lock. + */ list_for_each_entry_safe(page, t_page, list, lru) { folio = page_folio(page); update_and_free_hugetlb_folio(h, folio, false); - cond_resched(); } } _ Patches currently in -mm which might be from mike.kravetz(a)oracle.com are hugetlb-do-not-clear-hugetlb-dtor-until-allocating-vmemmap.patch

2 years

1
0
0 0

2025

2024

2023

2022

2021

2020

2019

2018

2017

Linux-stable-mirror July 2023