On Wed, Dec 11, 2024 at 08:20:36PM +0000, Matthew Wilcox wrote:
On Wed, Dec 11, 2024 at 11:32:13AM -0800, Shakeel Butt wrote:
On Wed, Dec 11, 2024 at 04:50:39PM +0000, Matthew Wilcox wrote:
Perhaps you'd be more persuaded by:
(a) If we clear __GFP_ACCOUNT then alloc_pages_bulk() will work, and that's a pretty significant performance win over calling alloc_pages() in a loop.
(b) Once we get to memdescs, calling alloc_pages() with __GFP_ACCOUNT set is going to require allocating a memdesc to store the obj_cgroup in, so in the future we'll save an allocation.
Your proposed alternative will work and is way less churn. But it's not preparing us for memdescs ;-)
We can make alloc_pages_bulk() work with __GFP_ACCOUNT but your second argument is more compelling.
I am trying to think of what will we miss if we remove this per-page memcg metadata. One thing I can think of is debugging a live system or kdump where I need to track where a given page came from. I think
Umm, I don't think you know which vmalloc allocation a page came from today? I've sent patches to add that information before, but they were rejected.
Do you have a link handy for that discussion?
In fact, I don't think we know even _that_ a page belongs to vmalloc today, do we? Yes, we know that the page is accounted, and which memcg it belongs to ... but nothing more.
Yes you are correct. At the moment it is a guesswork and exhaustive search into multiple sources.
I actually want to improve this, without adding additional overhead. What I'm working on right now (before I got waylaid by this bug) is:
+struct choir {
struct kref refcount;
unsigned int nr;
struct page *pages[] __counted_by(nr);
+};
and rewriting vmalloc to be based on choirs instead of its own pages. One thing I've come to realise today is that the obj_cgroup pointer needs to be in the choir and not in the vm_struct so that we uncharge the allocation when the choir refcount drops to 0, not when the allocation is unmapped.
What/who else can take a reference on a choir?
A regular choir allocation will (today) mark the pages in it as being allocated to a choir (and thus not having their own refcount / mapcount), but I'll give vmalloc a way to mark the pages as specifically being from vmalloc.
This sounds good. Thanks for the awesome work.