On Thu, Nov 23, 2023 at 07:20:46PM +0100, Oleksandr Natalenko wrote:
Hello.
Since v6.6.2 kernel release I'm experiencing a regression with regard to USB ports behaviour after a suspend/resume cycle.
If a USB port is empty before suspending, after resuming the machine the port doesn't work. After a device insertion there's no reaction in the kernel log whatsoever, although I do see that the device gets powered up physically. If the machine is suspended with a device inserted into the USB port, the port works fine after resume.
This is an AMD-based machine with hci version 0x110 reported. As per the changelog between v6.6.1 and v6.6.2, 603 commits were backported into v6.6.2, and one of the commits was as follows:
$ git log --oneline v6.6.1..v6.6.2 -- drivers/usb/host/xhci-pci.c 14a51fa544225 xhci: Loosen RPM as default policy to cover for AMD xHC 1.1
It seems that this commit explicitly enables runtime PM specifically for my platform. As per dmesg:
v6.6.1: quirks 0x0000000000000410 v6.6.2: quirks 0x0000000200000410
Here, bit 33 gets set, which, as expected, corresponds to:
drivers/usb/host/xhci.h 1895:#define XHCI_DEFAULT_PM_RUNTIME_ALLOW BIT_ULL(33)
This commit is backported from the upstream commit 4baf12181509, which is one of 16 commits of the following series named "xhci features":
https://lore.kernel.org/all/20231019102924.2797346-1-mathias.nyman@linux.int...
It appears that there was another commit in this series, also from Basavaraj (in Cc), a5d6264b638e, which was not picked for v6.6.2, but which stated the following:
Use the low-power states of the underlying platform to enable runtime PM. If the platform doesn't support runtime D3, then enabling default RPM will result in the controller malfunctioning, as in the case of hotplug devices not being detected because of a failed interrupt generation.
It felt like this was exactly my case. So, I've conducted two tests:
- Reverted 14a51fa544225 from v6.6.2. With this revert the USB ports started to work fine, just as they did in v6.6.1.
- Left 14a51fa544225 in place, but also applied upstream a5d6264b638e on top of v6.6.2. With this patch added the USB ports also work after a suspend/resume cycle.
This runtime PM enablement did also impact my AX200 Bluetooth device, resulting in long delays before headphones/speaker can connect, but I've solved this with btusb.enable_autosuspend=N. I think this has nothing to do with the original issue, and I'm OK with this workaround unless someone has got a different idea.
With that, please consider either reverting 14a51fa544225 from the stable kernel, or applying a5d6264b638e in addition to it. Given the mainline kernel has got both of them, I'm in favour of applying additional commit to the stable kernel.
I've applied this other commit as well to all of the affected branches, thanks for letting us know.
I'm also Cc'ing all the people from our Mastodon discussion where I initially complained about the issue as well as about stable kernel branch stability:
https://activitypub.natalenko.name/@oleksandr/statuses/01HFRXBYWMXF9G4KYPE3X...
I'm not going to expand more on that in this email, especially given Greg indicated he read the conversation, but I'm open to continuing this discussion as I still think that current workflow brings visible issues to ordinary users, and hence some adjustments should be made.
What type of adjustments exactly? Testing on wide ranges of systems is pretty hard, and this patch explicitly was set to be backported when it hit Linus's tree, it just looks like someone forgot to mark the follow-up patch that you found also to be properly backported.
We will always make mistakes, we are only human. The best thing to do is if we get notified quickly of issues, like you did here, and work to resolve them, as we have done here. So again, thanks for letting us know about the problem, and be sure to let us know of any future issues you might find as well.
Remember, hardware is messy, and the kernel's job is to fix hardware issues and quirks in it. Sometimes we get it wrong as we are trying to fix up inconsistencies and they cause other problems, so in the end, we can only grumble at the hardware companies for stuff like this, be patient with those of us who have to deal with this mess :)
thanks,
greg k-h