On Wed, Dec 11, 2024 at 11:32:13AM -0800, Shakeel Butt wrote:
On Wed, Dec 11, 2024 at 04:50:39PM +0000, Matthew Wilcox wrote:
Perhaps you'd be more persuaded by:
(a) If we clear __GFP_ACCOUNT then alloc_pages_bulk() will work, and that's a pretty significant performance win over calling alloc_pages() in a loop.
(b) Once we get to memdescs, calling alloc_pages() with __GFP_ACCOUNT set is going to require allocating a memdesc to store the obj_cgroup in, so in the future we'll save an allocation.
Your proposed alternative will work and is way less churn. But it's not preparing us for memdescs ;-)
We can make alloc_pages_bulk() work with __GFP_ACCOUNT but your second argument is more compelling.
I am trying to think of what will we miss if we remove this per-page memcg metadata. One thing I can think of is debugging a live system or kdump where I need to track where a given page came from. I think
Umm, I don't think you know which vmalloc allocation a page came from today? I've sent patches to add that information before, but they were rejected. In fact, I don't think we know even _that_ a page belongs to vmalloc today, do we? Yes, we know that the page is accounted, and which memcg it belongs to ... but nothing more.
I actually want to improve this, without adding additional overhead. What I'm working on right now (before I got waylaid by this bug) is:
+struct choir { + struct kref refcount; + unsigned int nr; + struct page *pages[] __counted_by(nr); +};
and rewriting vmalloc to be based on choirs instead of its own pages. One thing I've come to realise today is that the obj_cgroup pointer needs to be in the choir and not in the vm_struct so that we uncharge the allocation when the choir refcount drops to 0, not when the allocation is unmapped.
A regular choir allocation will (today) mark the pages in it as being allocated to a choir (and thus not having their own refcount / mapcount), but I'll give vmalloc a way to mark the pages as specifically being from vmalloc.
There's a lot of moving parts to this ... it's proving quite tricky!
I think we can go with Johannes' solution for stable and discuss the future direction more separately.
OK, I'll send a patch to do that.