Re: [PATCH] drm/amdgpu: Raven: don't allow mixing GTT and VRAM

18 Jul 2025


      On Fri, Jul 18, 2025 at 6:01 PM Leo Li sunpeng.li@amd.com wrote:
...
On 2025-07-18 17:33, Alex Deucher wrote:
...
On Fri, Jul 18, 2025 at 5:02 PM Leo Li sunpeng.li@amd.com wrote:
...
On 2025-07-18 16:07, Alex Deucher wrote:
...
On Fri, Jul 18, 2025 at 1:57 PM Brian Geffon bgeffon@google.com wrote:
...
On Thu, Jul 17, 2025 at 10:59 AM Alex Deucher alexdeucher@gmail.com wrote:
...
On Wed, Jul 16, 2025 at 8:13 PM Brian Geffon bgeffon@google.com wrote:
>
> On Wed, Jul 16, 2025 at 5:03 PM Alex Deucher alexdeucher@gmail.com wrote:
>>
>> On Wed, Jul 16, 2025 at 12:40 PM Brian Geffon bgeffon@google.com wrote:
>>>
>>> On Wed, Jul 16, 2025 at 12:33 PM Alex Deucher alexdeucher@gmail.com wrote:
>>>>
>>>> On Wed, Jul 16, 2025 at 12:18 PM Brian Geffon bgeffon@google.com wrote:
>>>>>
>>>>> Commit 81d0bcf99009 ("drm/amdgpu: make display pinning more flexible (v2)")
>>>>> allowed for newer ASICs to mix GTT and VRAM, this change also noted that
>>>>> some older boards, such as Stoney and Carrizo do not support this.
>>>>> It appears that at least one additional ASIC does not support this which
>>>>> is Raven.
>>>>>
>>>>> We observed this issue when migrating a device from a 5.4 to 6.6 kernel
>>>>> and have confirmed that Raven also needs to be excluded from mixing GTT
>>>>> and VRAM.
>>>>
>>>> Can you elaborate a bit on what the problem is?  For carrizo and
>>>> stoney this is a hardware limitation (all display buffers need to be
>>>> in GTT or VRAM, but not both).  Raven and newer don't have this
>>>> limitation and we tested raven pretty extensively at the time.s
>>>
>>> Thanks for taking the time to look. We have automated testing and a
>>> few igt gpu tools tests failed and after debugging we found that
>>> commit 81d0bcf99009 is what introduced the failures on this hardware
>>> on 6.1+ kernels. The specific tests that fail are kms_async_flips and
>>> kms_plane_alpha_blend, excluding Raven from this sharing of GTT and
>>> VRAM buffers resolves the issue.
>>
>> + Harry and Leo
>>
>> This sounds like the memory placement issue we discussed last week.
>> In that case, the issue is related to where the buffer ends up when we
>> try to do an async flip.  In that case, we can't do an async flip
>> without a full modeset if the buffers locations are different than the
>> last modeset because we need to update more than just the buffer base
>> addresses.  This change works around that limitation by always forcing
>> display buffers into VRAM or GTT.  Adding raven to this case may fix
>> those tests but will make the overall experience worse because we'll
>> end up effectively not being able to not fully utilize both gtt and
>> vram for display which would reintroduce all of the problems fixed by
>> 81d0bcf99009 ("drm/amdgpu: make display pinning more flexible (v2)").
>
> Thanks Alex, the thing is, we only observe this on Raven boards, why
> would Raven only be impacted by this? It would seem that all devices
> would have this issue, no? Also, I'm not familiar with how
It depends on memory pressure and available memory in each pool.
E.g., initially the display buffer is in VRAM when the initial mode
set happens.  The watermarks, etc. are set for that scenario.  One of
the next frames ends up in a pool different than the original.  Now
the buffer is in GTT.  The async flip interface does a fast validation
to try and flip as soon as possible, but that validation fails because
the watermarks need to be updated which requires a full modeset.
Huh, I'm not sure if this actually is an issue for APUs. The fix that introduced
a check for same memory placement on async flips was on a system with a DGPU,
for which VRAM placement does matter:
https://github.com/torvalds/linux/commit/a7c0cad0dc060bb77e9c9d235d68441b0fc...
Looking around in DM/DML, for APUs, I don't see any logic that changes DCN
bandwidth validation depending on memory placement. There's a gpuvm_enable flag
for SG, but it's statically set to 1 on APU DCN versions. It sounds like for
APUs specifically, we *should* be able to ignore the mem placement check. I can
spin up a patch to test this out.
Is the gpu_vm_support flag ever set for dGPUs?  The allowed domains
for display buffers are determined by
amdgpu_display_supported_domains() and we only allow GTT as a domain
if gpu_vm_support is set, which I think is just for APUs.  In that
case, we could probably only need the checks specifically for
CHIP_CARRIZO and CHIP_STONEY since IIRC, they don't support mixed VRAM
and GTT (only one or the other?).  dGPUs and really old APUs will
always get VRAM, and newer APUs will get VRAM | GTT.
It doesn't look like gpu_vm_support is set for DGPUs
https://elixir.bootlin.com/linux/v6.15.6/source/drivers/gpu/drm/amd/display/...
Though interestingly, further up at #L1858, Raven has gpu_vm_support = 0. Maybe it had stability issues?
https://github.com/torvalds/linux/commit/098c13079c6fdd44f10586b69132c392ebf...
We need to be a little careful here asic_type == CHIP_RAVEN covers
several variants:
apu_flags & AMD_APU_IS_RAVEN - raven1 (gpu_vm_support = false)
apu_flags & AMD_APU_IS_RAVEN2 - raven2 (gpu_vm_support = true)
apu_flags & AMD_APU_IS_PICASSO - picasso (gpu_vm_support = true)
amdgpu_display_supported_domains() only sets AMDGPU_GEM_DOMAIN_GTT if
gpu_vm_support is true.  so we'd never get into the check in
amdgpu_bo_get_preferred_domain() for raven1.
Anyway, back to your suggestion, I think we can probably drop the
checks as you should always get a compatible memory buffer due to
amdgpu_bo_get_preferred_domain(). Pinning should fail if we can't pin
in the required domain.  amdgpu_display_supported_domains() will
ensure you always get VRAM or GTT or VRAM | GTT depending on what the
chip supports.  Then amdgpu_bo_get_preferred_domain() will either
leave that as is, or force VRAM or GTT for the STONEY/CARRIZO case.
On the off chance we do get incompatible memory, something like the
attached patch should do the trick.
Alex
...

Leo

...
Alex
...
Thanks,
Leo
...
...
...
It's tricky to fix because you don't want to use the worst case
watermarks all the time because that will limit the number available
display options and you don't want to force everything to a particular
memory pool because that will limit the amount of memory that can be
used for display (which is what the patch in question fixed).  Ideally
the caller would do a test commit before the page flip to determine
whether or not it would succeed before issuing it and then we'd have
some feedback mechanism to tell the caller that the commit would fail
due to buffer placement so it would do a full modeset instead.  We
discussed this feedback mechanism last week at the display hackfest.
> kms_plane_alpha_blend works, but does this also support that test
> failing as the cause?
That may be related.  I'm not too familiar with that test either, but
Leo or Harry can provide some guidance.
Alex
Thanks everyone for the input so far. I have a question for the
maintainers, given that it seems that this is functionally broken for
ASICs which are iGPUs, and there does not seem to be an easy fix, does
it make sense to extend this proposed patch to all iGPUs until a more
permanent fix can be identified? At the end of the day I'll take
functional correctness over performance.
It's not functional correctness, it's usability.  All that is
potentially broken is async flips (which depend on memory pressure and
buffer placement), while if you effectively revert the patch, you end
up  limiting all display buffers to either VRAM or GTT which may end
up causing the inability to display anything because there is not
enough memory in that pool for the next modeset.  We'll start getting
bug reports about blank screens and failure to set modes because of
memory pressure.  I think if we want a short term fix, it would be to
always set the worst case watermarks.  The downside to that is that it
would possibly cause some working display setups to stop working if
they were on the margins to begin with.
Alex
...
Brian
...
>
> Thanks again,
> Brian
>
>>
>> Alex
>>
>>>
>>> Brian
>>>
>>>>
>>>>
>>>> Alex
>>>>
>>>>>
>>>>> Fixes: 81d0bcf99009 ("drm/amdgpu: make display pinning more flexible (v2)")
>>>>> Cc: Luben Tuikov luben.tuikov@amd.com
>>>>> Cc: Christian König christian.koenig@amd.com
>>>>> Cc: Alex Deucher alexander.deucher@amd.com
>>>>> Cc: stable@vger.kernel.org # 6.1+
>>>>> Tested-by: Thadeu Lima de Souza Cascardo cascardo@igalia.com
>>>>> Signed-off-by: Brian Geffon bgeffon@google.com
>>>>> ---
>>>>>  drivers/gpu/drm/amd/amdgpu/amdgpu_object.c | 3 ++-
>>>>>  1 file changed, 2 insertions(+), 1 deletion(-)
>>>>>
>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c
>>>>> index 73403744331a..5d7f13e25b7c 100644
>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c
>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c
>>>>> @@ -1545,7 +1545,8 @@ uint32_t amdgpu_bo_get_preferred_domain(struct amdgpu_device *adev,
>>>>>                                             uint32_t domain)
>>>>>  {
>>>>>         if ((domain == (AMDGPU_GEM_DOMAIN_VRAM | AMDGPU_GEM_DOMAIN_GTT)) &&
>>>>> -           ((adev->asic_type == CHIP_CARRIZO) || (adev->asic_type == CHIP_STONEY))) {
>>>>> +           ((adev->asic_type == CHIP_CARRIZO) || (adev->asic_type == CHIP_STONEY) ||
>>>>> +            (adev->asic_type == CHIP_RAVEN))) {
>>>>>                 domain = AMDGPU_GEM_DOMAIN_VRAM;
>>>>>                 if (adev->gmc.real_vram_size <= AMDGPU_SG_THRESHOLD)
>>>>>                         domain = AMDGPU_GEM_DOMAIN_GTT;
>>>>> --
>>>>> 2.50.0.727.gbf7dc18ff4-goog
>>>>>

2025

2024

2023

2022

2021

2020

2019

2018

2017

Re: [PATCH] drm/amdgpu: Raven: don't allow mixing GTT and VRAM