On 2/24/26 10:43, Maxime Ripard wrote:
Hi Christian,
On Fri, Feb 20, 2026 at 10:45:08AM +0100, Christian König wrote:
On 2/20/26 02:14, T.J. Mercier wrote:
On Wed, Feb 18, 2026 at 9:15 AM Eric Chanudet echanude@redhat.com wrote:
Hi Eric,
An earlier series[1] from Maxime introduced dmem to the cma allocator in an attempt to use it generally for dma-buf. Restart from there and apply the charge in the narrower context of the CMA dma-buf heap instead.
In line with introducing cgroup to the system heap[2], this behavior is enabled based on dma_heap.mem_accounting, disabled by default.
dmem is chosen for CMA heaps as it allows limits to be set for each region backing each heap. The charge is only put in the dma-buf heap for now as it guaranties it can be accounted against a userspace process that requested the allocation.
But CMA memory is system memory, and regular (non-CMA) movable allocations can occur out of these CMA areas. So this splits system memory accounting between memcg (from [2]) and dmem. If I want to put a limit on system memory use I have to adjust multiple limits (memcg + dmems) and know how to divide the total between them all.
How do you envision using this combination of different controllers?
Yeah we have this problem pretty much everywhere.
There are both use cases where you want to account device allocations to memcg and when you don't want that.
From what I know at the moment it would be best if the administrator could say for each dmem if it should account additionally to memcg or not.
Using module parameters to enable/disable it globally is just a workaround as far as I can see.
That's a pretty good idea! It would indeed be a solution that could satisfy everyone (I assume?).
I think so yeah.
From what I have seen we have three different use cases:
1. local device memory (VRAM), GTT/CMA and memcg are completely separate domains and you want to have completely separate values as limit for them.
2. local device memory (VRAM) is separate. GTT/CMA are accounted to memcg, you can still have separate values as limit so that nobody over allocates CMA (for example).
3. All three are accounted to memcg because system memory is actually used as fallback if applications over allocate device local memory.
It's debatable what should be the default, but we clearly need to handle all three use cases. Potentially even on the same system.
Regards, Christian.
Maxime
On Tue, 24 Feb 2026 at 20:32, Christian König christian.koenig@amd.com wrote:
On 2/24/26 10:43, Maxime Ripard wrote:
Hi Christian,
On Fri, Feb 20, 2026 at 10:45:08AM +0100, Christian König wrote:
On 2/20/26 02:14, T.J. Mercier wrote:
On Wed, Feb 18, 2026 at 9:15 AM Eric Chanudet echanude@redhat.com wrote:
Hi Eric,
An earlier series[1] from Maxime introduced dmem to the cma allocator in an attempt to use it generally for dma-buf. Restart from there and apply the charge in the narrower context of the CMA dma-buf heap instead.
In line with introducing cgroup to the system heap[2], this behavior is enabled based on dma_heap.mem_accounting, disabled by default.
dmem is chosen for CMA heaps as it allows limits to be set for each region backing each heap. The charge is only put in the dma-buf heap for now as it guaranties it can be accounted against a userspace process that requested the allocation.
But CMA memory is system memory, and regular (non-CMA) movable allocations can occur out of these CMA areas. So this splits system memory accounting between memcg (from [2]) and dmem. If I want to put a limit on system memory use I have to adjust multiple limits (memcg + dmems) and know how to divide the total between them all.
How do you envision using this combination of different controllers?
Yeah we have this problem pretty much everywhere.
There are both use cases where you want to account device allocations to memcg and when you don't want that.
From what I know at the moment it would be best if the administrator could say for each dmem if it should account additionally to memcg or not.
Using module parameters to enable/disable it globally is just a workaround as far as I can see.
That's a pretty good idea! It would indeed be a solution that could satisfy everyone (I assume?).
I think so yeah.
From what I have seen we have three different use cases:
local device memory (VRAM), GTT/CMA and memcg are completely separate domains and you want to have completely separate values as limit for them.
local device memory (VRAM) is separate. GTT/CMA are accounted to memcg, you can still have separate values as limit so that nobody over allocates CMA (for example).
All three are accounted to memcg because system memory is actually used as fallback if applications over allocate device local memory.
It's debatable what should be the default, but we clearly need to handle all three use cases. Potentially even on the same system.
Give me cases where 1 or 3 actually make sense in the real world.
I can maybe take 1 if CMA is just old school CMA carved out preboot so it's not in the main memory pool, but in that case it's just equiv to device memory really
If something is in the main memory pool, it should be accounted for using memcg. You cannot remove memory from the main memory pool without accounting for it. Now we can add gpu limits to memcg, that was going to me a next step in my series.
Whether we have that as a percentage or a hard limit, we would just say GPU can consume 95% of the configured max for this cgroup.
3 to me just sounds like we haven't figured out fallback or suspend/resume accounting yet, which is true, but I'm not sure there is a reason for 3 to exist outside of the we don't know how to account for temporary storage of swapped out VRAM objects.
Like it might be we need to have it so we have a limited transfer pool of system memory for VRAM objects to "live in" but we move them to swap as soon as possible once we get to the limit on that. Now what we do on systems where no swap is available, that gets into I've no idea space.
Static partitioning memcg up into a dmem and memcg isn't going to solve this, we should solve it inside memcg.
Dave.
On 2/26/26 00:43, Dave Airlie wrote:
Using module parameters to enable/disable it globally is just a workaround as far as I can see.
That's a pretty good idea! It would indeed be a solution that could satisfy everyone (I assume?).
I think so yeah.
From what I have seen we have three different use cases:
local device memory (VRAM), GTT/CMA and memcg are completely separate domains and you want to have completely separate values as limit for them.
local device memory (VRAM) is separate. GTT/CMA are accounted to memcg, you can still have separate values as limit so that nobody over allocates CMA (for example).
All three are accounted to memcg because system memory is actually used as fallback if applications over allocate device local memory.
It's debatable what should be the default, but we clearly need to handle all three use cases. Potentially even on the same system.
Give me cases where 1 or 3 actually make sense in the real world.
I can maybe take 1 if CMA is just old school CMA carved out preboot so it's not in the main memory pool, but in that case it's just equiv to device memory really
Well I think #1 is pretty much the default for dGPUs on a desktop. That's why I mentioned it first.
If something is in the main memory pool, it should be accounted for using memcg. You cannot remove memory from the main memory pool without accounting for it.
That's what I'm strongly disagreeing on. See the page cache is not accounted to memcg either, so when you open a file and the kernel caches the backing pages that doesn't reduce the amount you can allocate through malloc, doesn't it?
For dGPUs GTT is basically just the fallback when you over allocate local memory (plus a few things for uploads).
In other words system memory becomes the swap of device local memory. Just think about why memcg doesn't limits swap but only how much is swapped out.
For those use cases you want to have a hard static limit on how much system memory can be used as swap. That's why we originally used to have the per driver gttsize, the global TTM page limit etc...
The problem is that we weakened those limitations because of the APU use case and that in turn resulted in all those problems with browsers over allocating system memory etc....
Now cgroups should provide an alternative and I still think that this is the right approach to solve this, but in this alternative I think we want to preserve the original idea of separate domains for dGPUs.
Now we can add gpu limits to memcg, that was going to me a next step in my series.
Whether we have that as a percentage or a hard limit, we would just say GPU can consume 95% of the configured max for this cgroup.
That is only useful on APUs which don't have local memory because those make all of their allocations through system memory.
dGPUs should be much more limited in that regard.
3 to me just sounds like we haven't figured out fallback or suspend/resume accounting yet, which is true, but I'm not sure there is a reason for 3 to exist outside of the we don't know how to account for temporary storage of swapped out VRAM objects.
Mario has fixed or is at least working on the suspend/resume problems. So I don't consider that an issue any more.
The use case 3 happens on HPC systems where device local memory is basically just a cache. For example this one here: https://en.wikipedia.org/wiki/Frontier_(supercomputer)
In this use case you don't care if a buffer is in device local memory or system memory, what you care about is that things are reliable and for that your task at hand shouldn't exceeds a certain limit.
E.g. you run computation A which can use 100GB of resources and when computation B starts concurrently you don't want A to suddenly fail because it now fights with B for resources.
Like it might be we need to have it so we have a limited transfer pool of system memory for VRAM objects to "live in" but we move them to swap as soon as possible once we get to the limit on that. Now what we do on systems where no swap is available, that gets into I've no idea space.
Static partitioning memcg up into a dmem and memcg isn't going to solve this, we should solve it inside memcg.
Well it's certainly possible to solve all of this in memcg, but I don't think it's very elegant.
Static partitioning between memcg and dmeme for the dGPU case and merged accounting for the APU case by default and then giving the system administrator to eventually switch to use case 3 sounds much more flexible to me.
At least the obvious advantage is that you don't start to add module parameters to TTM, DMA-buf heaps and drivers if they should or should not account to memcg, but rather keep all the logic inside cgroups.
Christian.
Dave.
linaro-mm-sig@lists.linaro.org