Re: [PATCH v8 2/9] hugetlb_cgroup: add interface for charge/uncharge hugetlb reservations

26 Nov 2019

      On 11/25/19 12:26 PM, Mina Almasry wrote:
...
On Fri, Nov 8, 2019 at 4:46 PM Mike Kravetz mike.kravetz@oracle.com wrote:
...
On 11/8/19 4:40 PM, Mina Almasry wrote:
...
On Fri, Nov 8, 2019 at 4:01 PM Mike Kravetz mike.kravetz@oracle.com wrote:
...
On 11/8/19 3:48 PM, Mina Almasry wrote:
...
On Thu, Nov 7, 2019 at 4:57 PM Mike Kravetz mike.kravetz@oracle.com wrote:
...
On 10/29/19 6:36 PM, Mina Almasry wrote:
>
> +static void hugetlb_cgroup_move_parent_reservation(int idx,
> +                                                struct hugetlb_cgroup *h_cg)
> +{
> +     struct hugetlb_cgroup *parent = parent_hugetlb_cgroup(h_cg);
> +
> +     /* Move the reservation counters. */
> +     if (!parent_hugetlb_cgroup(h_cg)) {
> +             parent = root_h_cgroup;
> +             /* root has no limit */
> +             page_counter_charge(
> +                     &root_h_cgroup->reserved_hugepage[idx],
> +                     page_counter_read(
> +                             hugetlb_cgroup_get_counter(h_cg, idx, true)));
> +     }
> +
> +     /* Take the pages off the local counter */
> +     page_counter_cancel(
> +             hugetlb_cgroup_get_counter(h_cg, idx, true),
> +             page_counter_read(hugetlb_cgroup_get_counter(h_cg, idx, true)));
> +}
I know next to nothing about cgroups and am just comparing this to the
existing hugetlb_cgroup_move_parent() routine.  hugetlb_cgroup_move_parent
updates the cgroup pointer in each page being moved.  Do we need to do
something similar for reservations being moved (move pointer in reservation)?
Oh, good catch. Yes I need to be doing that. I should probably
consolidate those routines so the code doesn't miss things like this.
This might get a bit ugly/complicated?  Seems like you will need to examine
all hugetlbfs inodes and vma's mapping those inodes.
Hmm yes on closer look it does seem like this is not straightforward.
I'll write a test that does this reparenting so I can start running
into the issue and poke for solutions. Off the top of my head, I think
maybe we can just not reparent the hugetlb reservations - the
hugetlb_cgroup stays alive until all its memory is uncharged. That
shouldn't be too bad. Today, I think memcg doesn't reparent memory
when it gets offlined.
I'll poke at this a bit and come back with suggestions, you may want
to hold off reviewing the rest of the patches until then.
Ok, if we start considering what the correct cgroup reparenting semantics
should be it would be good to get input from others with more cgroup
experience.
So I looked into this and prototyped a couple of solutions:

We could repartent the hugetlb reservation using the same approach

that today we repartent hugetlb faults. Basically today faulted
hugetlb pages live on hstate->hugepage_activelist. When a hugetlb
cgroup gets offlined, this list is transversed and any pages on it
that point to the cgroup being offlined and reparented. hugetlb_lock
is used to make sure cgroup offlining doesn't race with a page being
freed. I can add another list, but one that has pointers to the
reservations made. When the cgroup is being offlined, it transverses
this list, and reparents any reservations (which will need to acquire
the proper resv_map->lock to do the parenting). hugetlb_lock needs
also to be acquired here to make sure that resv_map release doesn't
race with another thread reparenting the memory in that resv map.
Pros: Maintains current parenting behavior, and makes sure that
reparenting of reservations works exactly the same way as reparenting
of hugetlb faults.
Cons: Code is a bit complex. There may be subtle object lifetime bugs,
since I'm not 100% sure acquiring hugetlb_lock removes all races.

We could just not reparent hugetlb reservations. I.e. on hugetlb

cgroup offlining, the hugetlb faults get reparented (which maintains
current user facing behavior), but hugetlb reservation charges remain
charged to the hugetlb cgroup. The cgroup lives as a zombie until all
the reservations are uncharged.
Pros: Much easier implementation. Converges behavior of memcg and
hugetlb cgroup, since memcg also doesn't reparent memory charged to
it.
Cons: Behavior change as hugetlb cgroups will become zombies if there
are reservations charged to them. I've discussed offlist with Shakeel,
and AFAICT there are absolutely no user facing behavior change to
zombie cgroups. Only if the user is specifically detecting for
zombies.
I'm torn between these 2 options right now, but leaning towards #2. I
think I will propose #2 in a patch for review, and if anyone is broken
by that (again, my understanding is that is very unlikely), then I
propose a patch that reverts the changes in #2 and implements the
changes in #1.
I of course like option #2 because it introduces fewer (if any) additional
changes to the hugetlb reservation code for non-cgroup users. :)
...
Any feedback from Shakeel or other people with cgroup expertise
(especially for hugetlb cgroup or memcg)  is very useful here.
Yes, that would be very helpful.
-- 
Mike Kravetz

2025

2024

2023

2022

2021

2020

2019

2018

2017

Re: [PATCH v8 2/9] hugetlb_cgroup: add interface for charge/uncharge hugetlb reservations