From: Brian Norris briannorris@google.com
When transitioning to D3cold, __pci_set_power_state() will first transition a device to D3hot. If the device was already in D3hot, this will add excess work: (a) read/modify/write PMCSR; and (b) excess delay (pci_dev_d3_sleep()).
For (b), we already performed the necessary delay on the previous D3hot entry; this was extra noticeable when evaluating runtime PM transition latency.
Check whether we're already in the target state before continuing.
Note that __pci_set_power_state() already does this same check for other state transitions, but D3cold is special because __pci_set_power_state() converts it to D3hot for the purposes of PMCSR.
This seems to be an oversight in commit 0aacdc957401 ("PCI/PM: Clean up pci_set_low_power_state()").
Fixes: 0aacdc957401 ("PCI/PM: Clean up pci_set_low_power_state()") Cc: stable@vger.kernel.org Signed-off-by: Brian Norris briannorris@google.com Signed-off-by: Brian Norris briannorris@chromium.org ---
drivers/pci/pci.c | 3 +++ 1 file changed, 3 insertions(+)
diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c index b0f4d98036cd..7517f1380201 100644 --- a/drivers/pci/pci.c +++ b/drivers/pci/pci.c @@ -1539,6 +1539,9 @@ static int pci_set_low_power_state(struct pci_dev *dev, pci_power_t state, bool || (state == PCI_D2 && !dev->d2_support)) return -EIO;
+ if (state == dev->current_state) + return 0; + pci_read_config_word(dev, dev->pm_cap + PCI_PM_CTRL, &pmcsr); if (PCI_POSSIBLE_ERROR(pmcsr)) { pci_err(dev, "Unable to change power state from %s to %s, device inaccessible\n",
Hi,
On Fri, Oct 03, 2025 at 03:40:09PM -0700, Brian Norris wrote:
From: Brian Norris briannorris@google.com
When transitioning to D3cold, __pci_set_power_state() will first transition a device to D3hot. If the device was already in D3hot, this will add excess work: (a) read/modify/write PMCSR; and (b) excess delay (pci_dev_d3_sleep()).
How come the device is already in D3hot when __pci_set_power_state() is called? IIRC PCI core will transition the device to low power state so that it passes there the deepest possible state, and at that point the device is still in D0. Then __pci_set_power_state() puts it into D3hot and then turns if the power resource -> D3cold.
What I'm missing here?
For (b), we already performed the necessary delay on the previous D3hot entry; this was extra noticeable when evaluating runtime PM transition latency.
Check whether we're already in the target state before continuing.
Note that __pci_set_power_state() already does this same check for other state transitions, but D3cold is special because __pci_set_power_state() converts it to D3hot for the purposes of PMCSR.
This seems to be an oversight in commit 0aacdc957401 ("PCI/PM: Clean up pci_set_low_power_state()").
Fixes: 0aacdc957401 ("PCI/PM: Clean up pci_set_low_power_state()") Cc: stable@vger.kernel.org Signed-off-by: Brian Norris briannorris@google.com Signed-off-by: Brian Norris briannorris@chromium.org
BTW, I think only one SoB from you is enough ;-)
Hi Mika,
On Mon, Oct 06, 2025 at 03:52:22PM +0200, Mika Westerberg wrote:
On Fri, Oct 03, 2025 at 03:40:09PM -0700, Brian Norris wrote:
From: Brian Norris briannorris@google.com
When transitioning to D3cold, __pci_set_power_state() will first transition a device to D3hot. If the device was already in D3hot, this will add excess work: (a) read/modify/write PMCSR; and (b) excess delay (pci_dev_d3_sleep()).
How come the device is already in D3hot when __pci_set_power_state() is called? IIRC PCI core will transition the device to low power state so that it passes there the deepest possible state, and at that point the device is still in D0. Then __pci_set_power_state() puts it into D3hot and then turns if the power resource -> D3cold.
What I'm missing here?
Some PCI drivers call pci_set_power_state(..., PCI_D3hot) on their own when preparing for runtime or system suspend, so by the time they hit pci_finish_runtime_suspend(), they're in D3hot. Then, pci_target_state() may still pick a lower state (D3cold).
HTH, Brian
On Mon, Oct 06, 2025 at 11:32:38AM -0700, Brian Norris wrote:
On Mon, Oct 06, 2025 at 03:52:22PM +0200, Mika Westerberg wrote:
On Fri, Oct 03, 2025 at 03:40:09PM -0700, Brian Norris wrote:
From: Brian Norris briannorris@google.com
When transitioning to D3cold, __pci_set_power_state() will first transition a device to D3hot. If the device was already in D3hot, this will add excess work: (a) read/modify/write PMCSR; and (b) excess delay (pci_dev_d3_sleep()).
How come the device is already in D3hot when __pci_set_power_state() is called? IIRC PCI core will transition the device to low power state so that it passes there the deepest possible state, and at that point the device is still in D0. Then __pci_set_power_state() puts it into D3hot and then turns if the power resource -> D3cold.
What I'm missing here?
Some PCI drivers call pci_set_power_state(..., PCI_D3hot) on their own when preparing for runtime or system suspend, so by the time they hit pci_finish_runtime_suspend(), they're in D3hot. Then, pci_target_state() may still pick a lower state (D3cold).
We might need this change, but maybe this is also an opportunity to remove some of those pci_set_power_state(..., PCI_D3hot) calls from drivers.
I didn't look into any of them in detail, but I would jump at any chance to remove PCI details from driver suspend paths. There are only ~20 calls from suspend functions, ~25 from shutdown, and a few from poweroff. The fact that there are so few makes me think they might be leftovers that could be more fully converted to generic PM.
On Mon, Oct 06, 2025 at 02:33:33PM -0500, Bjorn Helgaas wrote:
On Mon, Oct 06, 2025 at 11:32:38AM -0700, Brian Norris wrote:
On Mon, Oct 06, 2025 at 03:52:22PM +0200, Mika Westerberg wrote:
On Fri, Oct 03, 2025 at 03:40:09PM -0700, Brian Norris wrote:
From: Brian Norris briannorris@google.com
When transitioning to D3cold, __pci_set_power_state() will first transition a device to D3hot. If the device was already in D3hot, this will add excess work: (a) read/modify/write PMCSR; and (b) excess delay (pci_dev_d3_sleep()).
How come the device is already in D3hot when __pci_set_power_state() is called? IIRC PCI core will transition the device to low power state so that it passes there the deepest possible state, and at that point the device is still in D0. Then __pci_set_power_state() puts it into D3hot and then turns if the power resource -> D3cold.
What I'm missing here?
Some PCI drivers call pci_set_power_state(..., PCI_D3hot) on their own when preparing for runtime or system suspend, so by the time they hit pci_finish_runtime_suspend(), they're in D3hot. Then, pci_target_state() may still pick a lower state (D3cold).
We might need this change, but maybe this is also an opportunity to remove some of those pci_set_power_state(..., PCI_D3hot) calls from drivers.
Agree. The PCI client drivers should have no business in opting for D3Hot in the suspend path. It should be the other way around, they should opt-out if they want by calling pci_save_state(), but that is also subject to discussion.
- Mani
Hi,
On Mon, Oct 06, 2025 at 11:32:38AM -0700, Brian Norris wrote:
Hi Mika,
On Mon, Oct 06, 2025 at 03:52:22PM +0200, Mika Westerberg wrote:
On Fri, Oct 03, 2025 at 03:40:09PM -0700, Brian Norris wrote:
From: Brian Norris briannorris@google.com
When transitioning to D3cold, __pci_set_power_state() will first transition a device to D3hot. If the device was already in D3hot, this will add excess work: (a) read/modify/write PMCSR; and (b) excess delay (pci_dev_d3_sleep()).
How come the device is already in D3hot when __pci_set_power_state() is called? IIRC PCI core will transition the device to low power state so that it passes there the deepest possible state, and at that point the device is still in D0. Then __pci_set_power_state() puts it into D3hot and then turns if the power resource -> D3cold.
What I'm missing here?
Some PCI drivers call pci_set_power_state(..., PCI_D3hot) on their own when preparing for runtime or system suspend, so by the time they hit pci_finish_runtime_suspend(), they're in D3hot. Then, pci_target_state() may still pick a lower state (D3cold).
Ah, right. Thanks for clarification.
Yeah, I agree with Bjorn and Mani that those calls should go away (PCI core does that already). That makes driver writes life simpler wrt. PCI PM.
linux-stable-mirror@lists.linaro.org