Hello
I've checked with 5.18 the problem is still there. Interestingly, I tried to revert the commit but it was rejected because of the change in the test from: if (!adev->in_s0ix) to: if (amdgpu_acpi_should_gpu_reset(adev))
in amdgpu_pmops_suspend.
I fixed the rejection, keeping shoud_gpu_reset, but it still fails. Then I changed to restore test of in_s0ix as it was in 5.17, and it works. I tried with a call to amd_gpu_asic_reset without testing at all in_s0ix, it works.
Therefore, my APU wants a reset in amdgpu_pmops_suspend.
By curiosity, I tried to do the reset in amdgpu_pmops_suspend_noirq as was intended in 5.18 original code, commenting out the test of amdgpu_acpi_should_gpu_reset(adev) (since this APU wants a reset). This does not work, I got the Fence timeout errors or freezes.
If I leave noirq function unchanged (original 5.18 code), and just add a reset in suspend() as was done in 5.17, it works.
Therefore, my GPU does NOT want to be reset in noirq, the reset must be in suspend.
In other words, I modified amdgpu_pmops_suspend (partial revert) like this and this works on my laptop:
static int amdgpu_pmops_suspend(struct device *dev) { struct drm_device *drm_dev = dev_get_drvdata(dev); struct amdgpu_device *adev = drm_to_adev(drm_dev); + int r;
if (amdgpu_acpi_is_s0ix_active(adev)) adev->in_s0ix = true; else adev->in_s3 = true; - return amdgpu_device_suspend(drm_dev, true); + r = amdgpu_device_suspend(drm_dev, true); + if (r) + return r; + if (!adev->in_s0ix) + return amdgpu_asic_reset(adev); return 0; }
static int amdgpu_pmops_suspend_noirq(struct device *dev) { struct drm_device *drm_dev = dev_get_drvdata(dev); struct amdgpu_device *adev = drm_to_adev(drm_dev);
if (amdgpu_acpi_should_gpu_reset(adev)) return amdgpu_asic_reset(adev);
return 0; }
I don't know if other APU want a reset, in the same context, and how to differentiate all the cases, so I cannot go further, but I can test patches if needed.
CC
Le mercredi 18 mai 2022, 08:37:27 CEST Thorsten Leemhuis a écrit :
On 18.05.22 07:54, Kai-Heng Feng wrote:
On Wed, May 18, 2022 at 1:52 PM Thorsten Leemhuis
regressions@leemhuis.info wrote:
On 17.05.22 19:37, casteyde.christian@free.fr wrote:
I've tryied to revert the offending commit on 5.18-rc7 (887f75cfd0da ("drm/amdgpu: Ensure HDA function is suspended before ASIC reset"), and the problem disappears so it's really this commit that breaks.
In that case I'll update the regzbot status to make sure it's visible as regression introduced in the 5.18 cycle:
#regzbot introduced: 887f75cfd0da
BTW: obviously would be nice to get this fixed before 5.18 is released (which might already happen on Sunday), especially as the culprit apparently was already backported to stable, but I guess that won't be easy...
Which made me wondering: is reverting the culprit temporarily in mainline (and reapplying it later with a fix) a option here?
It's too soon to call it's the culprit.
Well, sure, the root-cause might be somewhere else. But from the point of kernel regressions (and tracking them) it's the culprit, as that's the change that triggers the misbehavior. And that's how Linus approaches these things as well when it comes to reverting to fix regressions -- and he even might...
The suspend on the system doesn't work properly at the first place.
...ignore things like this, as long as a revert is unlikely to cause more damage than good.
Ciao. Thorsten
Ciao, Thorsten (wearing his 'the Linux kernel's regression tracker' hat)
P.S.: As the Linux kernel's regression tracker I deal with a lot of reports and sometimes miss something important when writing mails like this. If that's the case here, don't hesitate to tell me in a public reply, it's in everyone's interest to set the public record straight.