From: Bjorn Helgaas bhelgaas@google.com
This reverts commit 08d0cc5f34265d1a1e3031f319f594bd1970976c.
Michael reported that when attempting to resume from suspend to RAM on ASUS mini PC PN51-BB757MDE1 (DMI model: MINIPC PN51-E1), 08d0cc5f3426 ("PCI/ASPM: Remove pcie_aspm_pm_state_change()") caused a 12-second delay with no output, followed by a reboot.
Workarounds include:
- Reverting 08d0cc5f3426 ("PCI/ASPM: Remove pcie_aspm_pm_state_change()") - Booting with "pcie_aspm=off" - Booting with "pcie_aspm.policy=performance" - "echo 0 | sudo tee /sys/bus/pci/devices/0000:03:00.0/link/l1_aspm" before suspending - Connecting a USB flash drive
Fixes: 08d0cc5f3426 ("PCI/ASPM: Remove pcie_aspm_pm_state_change()") Reported-by: Michael Schaller michael@5challer.de Link: https://lore.kernel.org/r/76c61361-b8b4-435f-a9f1-32b716763d62@5challer.de Signed-off-by: Bjorn Helgaas bhelgaas@google.com Cc: stable@vger.kernel.org --- drivers/pci/pci.c | 6 ++++++ drivers/pci/pci.h | 2 ++ drivers/pci/pcie/aspm.c | 19 +++++++++++++++++++ 3 files changed, 27 insertions(+)
diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c index 55bc3576a985..bdbf8a94b4d0 100644 --- a/drivers/pci/pci.c +++ b/drivers/pci/pci.c @@ -1335,6 +1335,9 @@ static int pci_set_full_power_state(struct pci_dev *dev) pci_restore_bars(dev); }
+ if (dev->bus->self) + pcie_aspm_pm_state_change(dev->bus->self); + return 0; }
@@ -1429,6 +1432,9 @@ static int pci_set_low_power_state(struct pci_dev *dev, pci_power_t state) pci_power_name(dev->current_state), pci_power_name(state));
+ if (dev->bus->self) + pcie_aspm_pm_state_change(dev->bus->self); + return 0; }
diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h index 5ecbcf041179..f43873049d52 100644 --- a/drivers/pci/pci.h +++ b/drivers/pci/pci.h @@ -569,10 +569,12 @@ int pcie_retrain_link(struct pci_dev *pdev, bool use_lt); #ifdef CONFIG_PCIEASPM void pcie_aspm_init_link_state(struct pci_dev *pdev); void pcie_aspm_exit_link_state(struct pci_dev *pdev); +void pcie_aspm_pm_state_change(struct pci_dev *pdev); void pcie_aspm_powersave_config_link(struct pci_dev *pdev); #else static inline void pcie_aspm_init_link_state(struct pci_dev *pdev) { } static inline void pcie_aspm_exit_link_state(struct pci_dev *pdev) { } +static inline void pcie_aspm_pm_state_change(struct pci_dev *pdev) { } static inline void pcie_aspm_powersave_config_link(struct pci_dev *pdev) { } #endif
diff --git a/drivers/pci/pcie/aspm.c b/drivers/pci/pcie/aspm.c index 50b04ae5c394..8715e951c491 100644 --- a/drivers/pci/pcie/aspm.c +++ b/drivers/pci/pcie/aspm.c @@ -1008,6 +1008,25 @@ void pcie_aspm_exit_link_state(struct pci_dev *pdev) up_read(&pci_bus_sem); }
+/* @pdev: the root port or switch downstream port */ +void pcie_aspm_pm_state_change(struct pci_dev *pdev) +{ + struct pcie_link_state *link = pdev->link_state; + + if (aspm_disabled || !link) + return; + /* + * Devices changed PM state, we should recheck if latency + * meets all functions' requirement + */ + down_read(&pci_bus_sem); + mutex_lock(&aspm_lock); + pcie_update_aspm_capable(link->root); + pcie_config_aspm_path(link); + mutex_unlock(&aspm_lock); + up_read(&pci_bus_sem); +} + void pcie_aspm_powersave_config_link(struct pci_dev *pdev) { struct pcie_link_state *link = pdev->link_state;
On 1/2/2024 3:25 PM, Bjorn Helgaas wrote:
From: Bjorn Helgaas bhelgaas@google.com
This reverts commit 08d0cc5f34265d1a1e3031f319f594bd1970976c.
Michael reported that when attempting to resume from suspend to RAM on ASUS mini PC PN51-BB757MDE1 (DMI model: MINIPC PN51-E1), 08d0cc5f3426 ("PCI/ASPM: Remove pcie_aspm_pm_state_change()") caused a 12-second delay with no output, followed by a reboot.
Workarounds include:
- Reverting 08d0cc5f3426 ("PCI/ASPM: Remove pcie_aspm_pm_state_change()")
- Booting with "pcie_aspm=off"
- Booting with "pcie_aspm.policy=performance"
- "echo 0 | sudo tee /sys/bus/pci/devices/0000:03:00.0/link/l1_aspm" before suspending
- Connecting a USB flash drive
Did you find the root cause? Is this issue specific to that particular device? If yes, can we do a quirk?
Fixes: 08d0cc5f3426 ("PCI/ASPM: Remove pcie_aspm_pm_state_change()") Reported-by: Michael Schaller michael@5challer.de Link: https://lore.kernel.org/r/76c61361-b8b4-435f-a9f1-32b716763d62@5challer.de Signed-off-by: Bjorn Helgaas bhelgaas@google.com Cc: stable@vger.kernel.org
drivers/pci/pci.c | 6 ++++++ drivers/pci/pci.h | 2 ++ drivers/pci/pcie/aspm.c | 19 +++++++++++++++++++ 3 files changed, 27 insertions(+)
diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c index 55bc3576a985..bdbf8a94b4d0 100644 --- a/drivers/pci/pci.c +++ b/drivers/pci/pci.c @@ -1335,6 +1335,9 @@ static int pci_set_full_power_state(struct pci_dev *dev) pci_restore_bars(dev); }
- if (dev->bus->self)
pcie_aspm_pm_state_change(dev->bus->self);
- return 0;
} @@ -1429,6 +1432,9 @@ static int pci_set_low_power_state(struct pci_dev *dev, pci_power_t state) pci_power_name(dev->current_state), pci_power_name(state));
- if (dev->bus->self)
pcie_aspm_pm_state_change(dev->bus->self);
- return 0;
} diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h index 5ecbcf041179..f43873049d52 100644 --- a/drivers/pci/pci.h +++ b/drivers/pci/pci.h @@ -569,10 +569,12 @@ int pcie_retrain_link(struct pci_dev *pdev, bool use_lt); #ifdef CONFIG_PCIEASPM void pcie_aspm_init_link_state(struct pci_dev *pdev); void pcie_aspm_exit_link_state(struct pci_dev *pdev); +void pcie_aspm_pm_state_change(struct pci_dev *pdev); void pcie_aspm_powersave_config_link(struct pci_dev *pdev); #else static inline void pcie_aspm_init_link_state(struct pci_dev *pdev) { } static inline void pcie_aspm_exit_link_state(struct pci_dev *pdev) { } +static inline void pcie_aspm_pm_state_change(struct pci_dev *pdev) { } static inline void pcie_aspm_powersave_config_link(struct pci_dev *pdev) { } #endif diff --git a/drivers/pci/pcie/aspm.c b/drivers/pci/pcie/aspm.c index 50b04ae5c394..8715e951c491 100644 --- a/drivers/pci/pcie/aspm.c +++ b/drivers/pci/pcie/aspm.c @@ -1008,6 +1008,25 @@ void pcie_aspm_exit_link_state(struct pci_dev *pdev) up_read(&pci_bus_sem); } +/* @pdev: the root port or switch downstream port */ +void pcie_aspm_pm_state_change(struct pci_dev *pdev) +{
- struct pcie_link_state *link = pdev->link_state;
- if (aspm_disabled || !link)
return;
- /*
* Devices changed PM state, we should recheck if latency
* meets all functions' requirement
*/
- down_read(&pci_bus_sem);
- mutex_lock(&aspm_lock);
- pcie_update_aspm_capable(link->root);
- pcie_config_aspm_path(link);
- mutex_unlock(&aspm_lock);
- up_read(&pci_bus_sem);
+}
void pcie_aspm_powersave_config_link(struct pci_dev *pdev) { struct pcie_link_state *link = pdev->link_state;
On Tue, Jan 02, 2024 at 03:33:51PM -0800, Kuppuswamy Sathyanarayanan wrote:
On 1/2/2024 3:25 PM, Bjorn Helgaas wrote:
From: Bjorn Helgaas bhelgaas@google.com
This reverts commit 08d0cc5f34265d1a1e3031f319f594bd1970976c.
Michael reported that when attempting to resume from suspend to RAM on ASUS mini PC PN51-BB757MDE1 (DMI model: MINIPC PN51-E1), 08d0cc5f3426 ("PCI/ASPM: Remove pcie_aspm_pm_state_change()") caused a 12-second delay with no output, followed by a reboot.
Workarounds include:
- Reverting 08d0cc5f3426 ("PCI/ASPM: Remove pcie_aspm_pm_state_change()")
- Booting with "pcie_aspm=off"
- Booting with "pcie_aspm.policy=performance"
- "echo 0 | sudo tee /sys/bus/pci/devices/0000:03:00.0/link/l1_aspm" before suspending
- Connecting a USB flash drive
Did you find the root cause? Is this issue specific to that particular device? If yes, can we do a quirk?
Unfortunately we don't know the root cause yet. Without knowing the root cause, I don't think we can make a good quirk.
Bjorn
Hi Bjorn,
On Tue, Jan 02, 2024 at 05:25:50PM -0600, Bjorn Helgaas wrote:
From: Bjorn Helgaas bhelgaas@google.com
This reverts commit 08d0cc5f34265d1a1e3031f319f594bd1970976c.
Michael reported that when attempting to resume from suspend to RAM on ASUS mini PC PN51-BB757MDE1 (DMI model: MINIPC PN51-E1), 08d0cc5f3426 ("PCI/ASPM: Remove pcie_aspm_pm_state_change()") caused a 12-second delay with no output, followed by a reboot.
Workarounds include:
- Reverting 08d0cc5f3426 ("PCI/ASPM: Remove pcie_aspm_pm_state_change()")
- Booting with "pcie_aspm=off"
- Booting with "pcie_aspm.policy=performance"
- "echo 0 | sudo tee /sys/bus/pci/devices/0000:03:00.0/link/l1_aspm" before suspending
- Connecting a USB flash drive
Fixes: 08d0cc5f3426 ("PCI/ASPM: Remove pcie_aspm_pm_state_change()") Reported-by: Michael Schaller michael@5challer.de Link: https://lore.kernel.org/r/76c61361-b8b4-435f-a9f1-32b716763d62@5challer.de Signed-off-by: Bjorn Helgaas bhelgaas@google.com Cc: stable@vger.kernel.org
+/* @pdev: the root port or switch downstream port */ +void pcie_aspm_pm_state_change(struct pci_dev *pdev) +{
- struct pcie_link_state *link = pdev->link_state;
- if (aspm_disabled || !link)
return;
- /*
* Devices changed PM state, we should recheck if latency
* meets all functions' requirement
*/
- down_read(&pci_bus_sem);
- mutex_lock(&aspm_lock);
- pcie_update_aspm_capable(link->root);
- pcie_config_aspm_path(link);
- mutex_unlock(&aspm_lock);
- up_read(&pci_bus_sem);
+}
This function is now restored in 6.7 final and is called in paths which already hold the pci_bus_sem as reported by lockdep (see splat below).
This can potentially lead to a deadlock and specifically prevents using lockdep on Qualcomm platforms.
Not sure if you want to propagate whether the bus semaphore is held to pcie_aspm_pm_state_change() or if there was some alternative to restoring this function which should be explored instead.
Johan
============================================ WARNING: possible recursive locking detected 6.7.0 #40 Not tainted -------------------------------------------- kworker/u16:5/90 is trying to acquire lock: ffffacfa78ced000 (pci_bus_sem){++++}-{3:3}, at: pcie_aspm_pm_state_change+0x58/0xdc pcieport 0002:00:00.0: PME: Signaling with IRQ 197
but task is already holding lock: ffffacfa78ced000 pcieport 0002:00:00.0: AER: enabled with IRQ 197 (pci_bus_sem nvme nvme0: pci function 0002:01:00.0 ){++++}-{3:3} nvme 0002:01:00.0: enabling device (0000 -> 0002) , at: pci_walk_bus+0x34/0xbc
other info that might help us debug this: Possible unsafe locking scenario:
CPU0 ---- lock(pci_bus_sem); lock(pci_bus_sem);
*** DEADLOCK ***
May be due to missing lock nesting notation
4 locks held by kworker/u16:5/90: #0: ffff06c5c0008d38 ((wq_completion)events_unbound){+.+.}-{0:0}, at: process_one_work+0x150/0x53c #1: ffff800081c0bdd0 ((work_completion)(&entry->work)){+.+.}-{0:0}, at: process_one_work+0x150/0x53c #2: ffff06c5c0b7d0f8 (&dev->mutex){....}-{3:3}, at: __driver_attach_async_helper+0x3c/0xf4 #3: ffffacfa78ced000 (pci_bus_sem){++++}-{3:3}, at: pci_walk_bus+0x34/0xbc
stack backtrace: CPU: 1 PID: 90 Comm: kworker/u16:5 Not tainted 6.7.0 #40 Hardware name: LENOVO 21BYZ9SRUS/21BYZ9SRUS, BIOS N3HET53W (1.25 ) 10/12/2022 Workqueue: events_unbound async_run_entry_fn Call trace: dump_backtrace+0x9c/0x11c show_stack+0x18/0x24 dump_stack_lvl+0x60/0xac dump_stack+0x18/0x24 print_deadlock_bug+0x25c/0x348 __lock_acquire+0x10a4/0x2064 lock_acquire+0x1e8/0x318 down_read+0x60/0x184 pcie_aspm_pm_state_change+0x58/0xdc pci_set_full_power_state+0xa8/0x114 pci_set_power_state+0xc4/0x120 qcom_pcie_enable_aspm+0x1c/0x3c [pcie_qcom] pci_walk_bus+0x64/0xbc qcom_pcie_host_post_init_2_7_0+0x28/0x34 [pcie_qcom]
Hi Bjorn,
I never got a reply to this one so resending with updated Subject in case it got buried in your inbox.
On Mon, Jan 08, 2024 at 09:39:07AM +0100, Johan Hovold wrote:
On Tue, Jan 02, 2024 at 05:25:50PM -0600, Bjorn Helgaas wrote:
From: Bjorn Helgaas bhelgaas@google.com
This reverts commit 08d0cc5f34265d1a1e3031f319f594bd1970976c.
Michael reported that when attempting to resume from suspend to RAM on ASUS mini PC PN51-BB757MDE1 (DMI model: MINIPC PN51-E1), 08d0cc5f3426 ("PCI/ASPM: Remove pcie_aspm_pm_state_change()") caused a 12-second delay with no output, followed by a reboot.
Workarounds include:
- Reverting 08d0cc5f3426 ("PCI/ASPM: Remove pcie_aspm_pm_state_change()")
- Booting with "pcie_aspm=off"
- Booting with "pcie_aspm.policy=performance"
- "echo 0 | sudo tee /sys/bus/pci/devices/0000:03:00.0/link/l1_aspm" before suspending
- Connecting a USB flash drive
Fixes: 08d0cc5f3426 ("PCI/ASPM: Remove pcie_aspm_pm_state_change()") Reported-by: Michael Schaller michael@5challer.de Link: https://lore.kernel.org/r/76c61361-b8b4-435f-a9f1-32b716763d62@5challer.de Signed-off-by: Bjorn Helgaas bhelgaas@google.com Cc: stable@vger.kernel.org
+/* @pdev: the root port or switch downstream port */ +void pcie_aspm_pm_state_change(struct pci_dev *pdev) +{
- struct pcie_link_state *link = pdev->link_state;
- if (aspm_disabled || !link)
return;
- /*
* Devices changed PM state, we should recheck if latency
* meets all functions' requirement
*/
- down_read(&pci_bus_sem);
- mutex_lock(&aspm_lock);
- pcie_update_aspm_capable(link->root);
- pcie_config_aspm_path(link);
- mutex_unlock(&aspm_lock);
- up_read(&pci_bus_sem);
+}
This function is now restored in 6.7 final and is called in paths which already hold the pci_bus_sem as reported by lockdep (see splat below).
This can potentially lead to a deadlock and specifically prevents using lockdep on Qualcomm platforms.
Not sure if you want to propagate whether the bus semaphore is held to pcie_aspm_pm_state_change() or if there was some alternative to restoring this function which should be explored instead.
So to summarise, this patch, which is now commit
f93e71aea6c6 ("Revert "PCI/ASPM: Remove pcie_aspm_pm_state_change()"")
introduced a regression in 6.7-final for Qualcomm platforms (and some Intel platforms) similar to the one recently fixed by commit
f352ce999260 ("PCI: qcom: Fix potential deadlock when enabling ASPM").
Johan
#regzbot introduced: f93e71aea6c6
============================================ WARNING: possible recursive locking detected 6.7.0 #40 Not tainted
kworker/u16:5/90 is trying to acquire lock: ffffacfa78ced000 (pci_bus_sem){++++}-{3:3}, at: pcie_aspm_pm_state_change+0x58/0xdc pcieport 0002:00:00.0: PME: Signaling with IRQ 197 but task is already holding lock: ffffacfa78ced000 (pci_bus_sem){++++}-{3:3}, at: pci_walk_bus+0x34/0xbc other info that might help us debug this: Possible unsafe locking scenario:
CPU0 ---- lock(pci_bus_sem); lock(pci_bus_sem);
*** DEADLOCK ***
May be due to missing lock nesting notation
4 locks held by kworker/u16:5/90: #0: ffff06c5c0008d38 ((wq_completion)events_unbound){+.+.}-{0:0}, at: process_one_work+0x150/0x53c #1: ffff800081c0bdd0 ((work_completion)(&entry->work)){+.+.}-{0:0}, at: process_one_work+0x150/0x53c #2: ffff06c5c0b7d0f8 (&dev->mutex){....}-{3:3}, at: __driver_attach_async_helper+0x3c/0xf4 #3: ffffacfa78ced000 (pci_bus_sem){++++}-{3:3}, at: pci_walk_bus+0x34/0xbc stack backtrace: CPU: 1 PID: 90 Comm: kworker/u16:5 Not tainted 6.7.0 #40 Hardware name: LENOVO 21BYZ9SRUS/21BYZ9SRUS, BIOS N3HET53W (1.25 ) 10/12/2022 Workqueue: events_unbound async_run_entry_fn Call trace: dump_backtrace+0x9c/0x11c show_stack+0x18/0x24 dump_stack_lvl+0x60/0xac dump_stack+0x18/0x24 print_deadlock_bug+0x25c/0x348 __lock_acquire+0x10a4/0x2064 lock_acquire+0x1e8/0x318 down_read+0x60/0x184 pcie_aspm_pm_state_change+0x58/0xdc pci_set_full_power_state+0xa8/0x114 pci_set_power_state+0xc4/0x120 qcom_pcie_enable_aspm+0x1c/0x3c [pcie_qcom] pci_walk_bus+0x64/0xbc qcom_pcie_host_post_init_2_7_0+0x28/0x34 [pcie_qcom]
On Mon, Jan 22, 2024 at 11:53:35AM +0100, Johan Hovold wrote:
Hi Bjorn,
I never got a reply to this one so resending with updated Subject in case it got buried in your inbox.
I did see it but decided it was better to fix the problem with resume causing an unintended reboot, even though fixing that meant breaking lockdep again, since I don't think we have user reports of the potential deadlock lockdep finds.
08d0cc5f3426 ("PCI/ASPM: Remove pcie_aspm_pm_state_change()") was a start at fixing other problems and also improving the ASPM style, so I hope somebody steps up to fix both it and the lockdep issue. I haven't looked at it enough to have a preference for *how* to fix it.
Bjorn
On Mon, Jan 08, 2024 at 09:39:07AM +0100, Johan Hovold wrote:
On Tue, Jan 02, 2024 at 05:25:50PM -0600, Bjorn Helgaas wrote:
From: Bjorn Helgaas bhelgaas@google.com
This reverts commit 08d0cc5f34265d1a1e3031f319f594bd1970976c.
Michael reported that when attempting to resume from suspend to RAM on ASUS mini PC PN51-BB757MDE1 (DMI model: MINIPC PN51-E1), 08d0cc5f3426 ("PCI/ASPM: Remove pcie_aspm_pm_state_change()") caused a 12-second delay with no output, followed by a reboot.
Workarounds include:
- Reverting 08d0cc5f3426 ("PCI/ASPM: Remove pcie_aspm_pm_state_change()")
- Booting with "pcie_aspm=off"
- Booting with "pcie_aspm.policy=performance"
- "echo 0 | sudo tee /sys/bus/pci/devices/0000:03:00.0/link/l1_aspm" before suspending
- Connecting a USB flash drive
Fixes: 08d0cc5f3426 ("PCI/ASPM: Remove pcie_aspm_pm_state_change()") Reported-by: Michael Schaller michael@5challer.de Link: https://lore.kernel.org/r/76c61361-b8b4-435f-a9f1-32b716763d62@5challer.de Signed-off-by: Bjorn Helgaas bhelgaas@google.com Cc: stable@vger.kernel.org
+/* @pdev: the root port or switch downstream port */ +void pcie_aspm_pm_state_change(struct pci_dev *pdev) +{
- struct pcie_link_state *link = pdev->link_state;
- if (aspm_disabled || !link)
return;
- /*
* Devices changed PM state, we should recheck if latency
* meets all functions' requirement
*/
- down_read(&pci_bus_sem);
- mutex_lock(&aspm_lock);
- pcie_update_aspm_capable(link->root);
- pcie_config_aspm_path(link);
- mutex_unlock(&aspm_lock);
- up_read(&pci_bus_sem);
+}
This function is now restored in 6.7 final and is called in paths which already hold the pci_bus_sem as reported by lockdep (see splat below).
This can potentially lead to a deadlock and specifically prevents using lockdep on Qualcomm platforms.
Not sure if you want to propagate whether the bus semaphore is held to pcie_aspm_pm_state_change() or if there was some alternative to restoring this function which should be explored instead.
So to summarise, this patch, which is now commit
f93e71aea6c6 ("Revert "PCI/ASPM: Remove pcie_aspm_pm_state_change()"")
introduced a regression in 6.7-final for Qualcomm platforms (and some Intel platforms) similar to the one recently fixed by commit
f352ce999260 ("PCI: qcom: Fix potential deadlock when enabling ASPM").
Johan
#regzbot introduced: f93e71aea6c6
============================================ WARNING: possible recursive locking detected 6.7.0 #40 Not tainted
kworker/u16:5/90 is trying to acquire lock: ffffacfa78ced000 (pci_bus_sem){++++}-{3:3}, at: pcie_aspm_pm_state_change+0x58/0xdc pcieport 0002:00:00.0: PME: Signaling with IRQ 197 but task is already holding lock: ffffacfa78ced000 (pci_bus_sem){++++}-{3:3}, at: pci_walk_bus+0x34/0xbc other info that might help us debug this: Possible unsafe locking scenario:
CPU0 ---- lock(pci_bus_sem); lock(pci_bus_sem);
*** DEADLOCK ***
May be due to missing lock nesting notation
4 locks held by kworker/u16:5/90: #0: ffff06c5c0008d38 ((wq_completion)events_unbound){+.+.}-{0:0}, at: process_one_work+0x150/0x53c #1: ffff800081c0bdd0 ((work_completion)(&entry->work)){+.+.}-{0:0}, at: process_one_work+0x150/0x53c #2: ffff06c5c0b7d0f8 (&dev->mutex){....}-{3:3}, at: __driver_attach_async_helper+0x3c/0xf4 #3: ffffacfa78ced000 (pci_bus_sem){++++}-{3:3}, at: pci_walk_bus+0x34/0xbc stack backtrace: CPU: 1 PID: 90 Comm: kworker/u16:5 Not tainted 6.7.0 #40 Hardware name: LENOVO 21BYZ9SRUS/21BYZ9SRUS, BIOS N3HET53W (1.25 ) 10/12/2022 Workqueue: events_unbound async_run_entry_fn Call trace: dump_backtrace+0x9c/0x11c show_stack+0x18/0x24 dump_stack_lvl+0x60/0xac dump_stack+0x18/0x24 print_deadlock_bug+0x25c/0x348 __lock_acquire+0x10a4/0x2064 lock_acquire+0x1e8/0x318 down_read+0x60/0x184 pcie_aspm_pm_state_change+0x58/0xdc pci_set_full_power_state+0xa8/0x114 pci_set_power_state+0xc4/0x120 qcom_pcie_enable_aspm+0x1c/0x3c [pcie_qcom] pci_walk_bus+0x64/0xbc qcom_pcie_host_post_init_2_7_0+0x28/0x34 [pcie_qcom]
On Mon, Jan 22, 2024 at 12:26:15PM -0600, Bjorn Helgaas wrote:
On Mon, Jan 22, 2024 at 11:53:35AM +0100, Johan Hovold wrote:
I never got a reply to this one so resending with updated Subject in case it got buried in your inbox.
I did see it but decided it was better to fix the problem with resume causing an unintended reboot, even though fixing that meant breaking lockdep again, since I don't think we have user reports of the potential deadlock lockdep finds.
That may be because I fixed the previous regression in 6.7-rc1 before any users had a chance to hit the deadlock on Qualcomm platforms.
I can easily trigger a deadlock on the X13s by instrumenting 6.7-final with a delay to increase the race window.
And any user hitting this occasionally is likely not going to be able to track it down to this lock inversion (unless they have lockdep enabled).
08d0cc5f3426 ("PCI/ASPM: Remove pcie_aspm_pm_state_change()") was a start at fixing other problems and also improving the ASPM style, so I hope somebody steps up to fix both it and the lockdep issue. I haven't looked at it enough to have a preference for *how* to fix it.
Ok, but since you were the one introducing the locking regression in 6.7-final shouldn't you look into fixing it?
Especially if there were alternatives to restoring the offending commit which would solve the underlying issue for the resume failure without breaking other platforms.
I don't want to spend more time on this if the offending commit could simply be reverted.
Johan
On Tue, Jan 23, 2024 at 06:25:52PM +0100, Johan Hovold wrote:
On Mon, Jan 22, 2024 at 12:26:15PM -0600, Bjorn Helgaas wrote:
On Mon, Jan 22, 2024 at 11:53:35AM +0100, Johan Hovold wrote:
I never got a reply to this one so resending with updated Subject in case it got buried in your inbox.
I did see it but decided it was better to fix the problem with resume causing an unintended reboot, even though fixing that meant breaking lockdep again, since I don't think we have user reports of the potential deadlock lockdep finds.
That may be because I fixed the previous regression in 6.7-rc1 before any users had a chance to hit the deadlock on Qualcomm platforms.
I can easily trigger a deadlock on the X13s by instrumenting 6.7-final with a delay to increase the race window.
And any user hitting this occasionally is likely not going to be able to track it down to this lock inversion (unless they have lockdep enabled).
I agree, it's a problem we need to fix.
08d0cc5f3426 ("PCI/ASPM: Remove pcie_aspm_pm_state_change()") was a start at fixing other problems and also improving the ASPM style, so I hope somebody steps up to fix both it and the lockdep issue. I haven't looked at it enough to have a preference for *how* to fix it.
Ok, but since you were the one introducing the locking regression in 6.7-final shouldn't you look into fixing it?
Especially if there were alternatives to restoring the offending commit which would solve the underlying issue for the resume failure without breaking other platforms.
Did somebody propose an alternate patch? If so, I missed it, but we could look at it now.
I don't want to spend more time on this if the offending commit could simply be reverted.
I don't quite follow. By simply reverting, do you mean to revert f93e71aea6c6 ("Revert "PCI/ASPM: Remove pcie_aspm_pm_state_change()"")? IIUC that would break Michael's machine again.
Bjorn
On Tue, Jan 23, 2024 at 04:36:48PM -0600, Bjorn Helgaas wrote:
On Tue, Jan 23, 2024 at 06:25:52PM +0100, Johan Hovold wrote:
On Mon, Jan 22, 2024 at 12:26:15PM -0600, Bjorn Helgaas wrote:
On Mon, Jan 22, 2024 at 11:53:35AM +0100, Johan Hovold wrote:
08d0cc5f3426 ("PCI/ASPM: Remove pcie_aspm_pm_state_change()") was a start at fixing other problems and also improving the ASPM style, so I hope somebody steps up to fix both it and the lockdep issue. I haven't looked at it enough to have a preference for *how* to fix it.
Ok, but since you were the one introducing the locking regression in 6.7-final shouldn't you look into fixing it?
Especially if there were alternatives to restoring the offending commit which would solve the underlying issue for the resume failure without breaking other platforms.
Did somebody propose an alternate patch? If so, I missed it, but we could look at it now.
I've only skimmed the discussion leading up to the revert, but I got the impression that other alternatives were looked at as it was still not clear what the underlying issue actually was.
As Michael and Thorsten pointed out before the revert, it may have been better not to do a last minute revert of a 16 month old commit which risks introducing regressions (and brought back another sysfs issue IIUC) before fully understanding what is really going on here.
I don't want to spend more time on this if the offending commit could simply be reverted.
I don't quite follow. By simply reverting, do you mean to revert f93e71aea6c6 ("Revert "PCI/ASPM: Remove pcie_aspm_pm_state_change()"")? IIUC that would break Michael's machine again.
Right, at least until that issue is fully understood and alternative fixes have been considered.
If that's not an option, we need to rework core to pass a flag through more than one layer to indicate whether pcie_aspm_pm_state_change() should take the bus semaphore or not. I'd rather not do that if it can be avoided.
Johan
On Wed, Jan 24, 2024 at 09:16:38AM +0100, Johan Hovold wrote:
On Tue, Jan 23, 2024 at 04:36:48PM -0600, Bjorn Helgaas wrote:
I don't quite follow. By simply reverting, do you mean to revert f93e71aea6c6 ("Revert "PCI/ASPM: Remove pcie_aspm_pm_state_change()"")? IIUC that would break Michael's machine again.
Right, at least until that issue is fully understood and alternative fixes have been considered.
If that's not an option, we need to rework core to pass a flag through more than one layer to indicate whether pcie_aspm_pm_state_change() should take the bus semaphore or not. I'd rather not do that if it can be avoided.
As a revert appears unlikely to happen, let's fix the regression by adding a new helper pci_set_power_state_locked() that can be called with the bus lock held:
https://lore.kernel.org/lkml/20240130100243.11011-1-johan+linaro@kernel.org/
Johan
linux-stable-mirror@lists.linaro.org