Wow!
Sasha I think an impersonator has gotten into your account, and is just making nonsense up.
At first glance this reads like an impassioned plea to backport this change, from someone who has actually dealt with it.
Unfortunately reading the justification in detail is an exercise in reading falsehoods.
If this does not come from an impersonator then: - If this comes from a human being, I recommend you have a talk with them. - If this comes from a machine I recommend you take it out of commission and rework it.
At best all of this appears to be an effort to get someone else to do necessary thinking for you. As my time for kernel work is very limited I expect I will auto-nack any such future attempts to outsource someone else's thinking on me.
Eric
Sasha Levin sashal@kernel.org writes:
From: Mario Limonciello mario.limonciello@amd.com
[ Upstream commit 12ffc3b1513ebc1f11ae77d053948504a94a68a6 ]
Currently swap is restricted before drivers have had a chance to do their prepare() PM callbacks. Restricting swap this early means that if a driver needs to evict some content from memory into sawp in it's prepare callback, it won't be able to.
On AMD dGPUs this can lead to failed suspends under memory pressure situations as all VRAM must be evicted to system memory or swap.
Move the swap restriction to right after all devices have had a chance to do the prepare() callback. If there is any problem with the sequence, restore swap in the appropriate dpm resume callbacks or error handling paths.
Closes: https://github.com/ROCm/ROCK-Kernel-Driver/issues/174 Closes: https://gitlab.freedesktop.org/drm/amd/-/issues/2362 Signed-off-by: Mario Limonciello mario.limonciello@amd.com Tested-by: Nat Wittstock nat@fardog.io Tested-by: Lucian Langa lucilanga@7pot.org Link: https://patch.msgid.link/20250613214413.4127087-1-superm1@kernel.org Signed-off-by: Rafael J. Wysocki rafael.j.wysocki@intel.com Signed-off-by: Sasha Levin sashal@kernel.org
**YES**
This commit should be backported to stable kernel trees for the following reasons:
Really? And when those reasons turn out to be baloney?
## Critical Bug Fix for Real User Issues
- **Fixes Actual Suspend Failures**: The commit addresses real-world suspend failures under memory pressure on systems with AMD discrete GPUs. The linked issues (ROCm/ROCK-Kernel-Driver#174 and freedesktop.org/drm/amd#2362) indicate this affects actual users.
Those linked issues are completely corrupted in the paragraph above. From the original commit the proper issues are:
https://github.com/ROCm/ROCK-Kernel-Driver/issues/174 https://gitlab.freedesktop.org/drm/amd/-/issues/2362
Which indicate that something is going on, but are old enough and long enough coming to any kind of conclusion from them is not easy.
- **Regression Fix**: This is effectively a regression fix. The PM subsystem's early swap restriction prevents AMD GPU drivers from properly evicting VRAM during their prepare() callbacks, which is a requirement that has become more critical as GPU VRAM sizes have increased.
There is no indication that this used to work, or that an earlier kernel change caused this to stop working. This is not a regression.
## Small, Contained Change
- **Minimal Code Changes**: The fix is remarkably simple - it just moves the `pm_restrict_gfp_mask()` call from early in the suspend sequence to after `dpm_prepare()` completes. The changes are:
- Move `pm_restrict_gfp_mask()` from multiple early locations to inside `dpm_suspend_start()` after `dpm_prepare()` succeeds
- Add corresponding `pm_restore_gfp_mask()` calls in error paths and resume paths
- Remove the now-redundant calls from hibernate.c and suspend.c
Reworking how different layers of the kernel interact is not minimal, and it not self contained.
- **Low Risk of Regression**: The change maintains the original intent of preventing I/O during the critical suspend phase while allowing it during device preparation. The swap restriction still happens before `dpm_suspend()`, just after `dpm_prepare()`.
There is no analysis anywhere on what happens to the code with code that might expect the old behavior.
So it is not possible to conclude a low risk of regression, in fact we can't conclude anything.
## Follows Stable Rules
- **Meets Stable Criteria**:
- Fixes a real bug that bothers people (suspend failures)
Addresses a real bug, yes. Fixes?
- Small change (moves function calls, doesn't introduce new logic)
No.
- Obviously correct (allows drivers to use swap during their designated preparation phase)
Not at all. It certainly isn't obvious to me what is going on.
- Already tested by users (Tested-by tags from affected users)
Yes there are Tested-by tags.
## Similar to Other Backported Commits
- **Pattern Matches**: Looking at the similar commits provided, this follows the same pattern as the AMD GPU eviction commits that were backported. Those commits also addressed the same fundamental issue - ensuring GPU VRAM can be properly evicted during suspend/hibernation.
Which other commits are those?
## Critical Timing
Timing?
- **Error Path Handling**: The commit properly handles error paths by adding `pm_restore_gfp_mask()` calls in:
- `dpm_resume_end()` for normal resume
- `platform_recover()` error path in suspend.c
- `pm_restore_gfp_mask()` in kexec_core.c for kexec flows
I don't see anything in this change that has to do with error paths.
The commit is well-tested, addresses a real problem affecting users, and makes a minimal, obviously correct change to fix suspend failures on systems with discrete GPUs under memory pressure.
The evidence that a 3 week old change is well tested, simply because it has been merged into Linus's change seems lacking.
Tested yes, but is it well tested? Are there any possible side effects?
I certainly see no evidence of any testing or any exercise at all of the kexec path modified. I wasn't even away of this change until this backport came in.
Eric