On 11/25/19 12:26 PM, Mina Almasry wrote:
On Fri, Nov 8, 2019 at 4:46 PM Mike Kravetz mike.kravetz@oracle.com wrote:
On 11/8/19 4:40 PM, Mina Almasry wrote:
On Fri, Nov 8, 2019 at 4:01 PM Mike Kravetz mike.kravetz@oracle.com wrote:
On 11/8/19 3:48 PM, Mina Almasry wrote:
On Thu, Nov 7, 2019 at 4:57 PM Mike Kravetz mike.kravetz@oracle.com wrote:
On 10/29/19 6:36 PM, Mina Almasry wrote: > > +static void hugetlb_cgroup_move_parent_reservation(int idx, > + struct hugetlb_cgroup *h_cg) > +{ > + struct hugetlb_cgroup *parent = parent_hugetlb_cgroup(h_cg); > + > + /* Move the reservation counters. */ > + if (!parent_hugetlb_cgroup(h_cg)) { > + parent = root_h_cgroup; > + /* root has no limit */ > + page_counter_charge( > + &root_h_cgroup->reserved_hugepage[idx], > + page_counter_read( > + hugetlb_cgroup_get_counter(h_cg, idx, true))); > + } > + > + /* Take the pages off the local counter */ > + page_counter_cancel( > + hugetlb_cgroup_get_counter(h_cg, idx, true), > + page_counter_read(hugetlb_cgroup_get_counter(h_cg, idx, true))); > +}
I know next to nothing about cgroups and am just comparing this to the existing hugetlb_cgroup_move_parent() routine. hugetlb_cgroup_move_parent updates the cgroup pointer in each page being moved. Do we need to do something similar for reservations being moved (move pointer in reservation)?
Oh, good catch. Yes I need to be doing that. I should probably consolidate those routines so the code doesn't miss things like this.
This might get a bit ugly/complicated? Seems like you will need to examine all hugetlbfs inodes and vma's mapping those inodes.
Hmm yes on closer look it does seem like this is not straightforward. I'll write a test that does this reparenting so I can start running into the issue and poke for solutions. Off the top of my head, I think maybe we can just not reparent the hugetlb reservations - the hugetlb_cgroup stays alive until all its memory is uncharged. That shouldn't be too bad. Today, I think memcg doesn't reparent memory when it gets offlined.
I'll poke at this a bit and come back with suggestions, you may want to hold off reviewing the rest of the patches until then.
Ok, if we start considering what the correct cgroup reparenting semantics should be it would be good to get input from others with more cgroup experience.
So I looked into this and prototyped a couple of solutions:
- We could repartent the hugetlb reservation using the same approach
that today we repartent hugetlb faults. Basically today faulted hugetlb pages live on hstate->hugepage_activelist. When a hugetlb cgroup gets offlined, this list is transversed and any pages on it that point to the cgroup being offlined and reparented. hugetlb_lock is used to make sure cgroup offlining doesn't race with a page being freed. I can add another list, but one that has pointers to the reservations made. When the cgroup is being offlined, it transverses this list, and reparents any reservations (which will need to acquire the proper resv_map->lock to do the parenting). hugetlb_lock needs also to be acquired here to make sure that resv_map release doesn't race with another thread reparenting the memory in that resv map.
Pros: Maintains current parenting behavior, and makes sure that reparenting of reservations works exactly the same way as reparenting of hugetlb faults. Cons: Code is a bit complex. There may be subtle object lifetime bugs, since I'm not 100% sure acquiring hugetlb_lock removes all races.
- We could just not reparent hugetlb reservations. I.e. on hugetlb
cgroup offlining, the hugetlb faults get reparented (which maintains current user facing behavior), but hugetlb reservation charges remain charged to the hugetlb cgroup. The cgroup lives as a zombie until all the reservations are uncharged.
Pros: Much easier implementation. Converges behavior of memcg and hugetlb cgroup, since memcg also doesn't reparent memory charged to it. Cons: Behavior change as hugetlb cgroups will become zombies if there are reservations charged to them. I've discussed offlist with Shakeel, and AFAICT there are absolutely no user facing behavior change to zombie cgroups. Only if the user is specifically detecting for zombies.
I'm torn between these 2 options right now, but leaning towards #2. I think I will propose #2 in a patch for review, and if anyone is broken by that (again, my understanding is that is very unlikely), then I propose a patch that reverts the changes in #2 and implements the changes in #1.
I of course like option #2 because it introduces fewer (if any) additional changes to the hugetlb reservation code for non-cgroup users. :)
Any feedback from Shakeel or other people with cgroup expertise (especially for hugetlb cgroup or memcg) is very useful here.
Yes, that would be very helpful.