Hello, Waiman.
On Wed, Apr 12, 2023 at 03:52:36PM -0400, Waiman Long wrote:
There is still a distribution hierarchy as the list of isolation CPUs have to be distributed down to the target cgroup through the hierarchy. For example,
cgroup root +- isolcpus (cpus 8,9; isolcpus) +- user.slice (cpus 1-9; ecpus 1-7; member) +- user-x.slice (cpus 8,9; ecpus 8,9; isolated) +- user-y.slice (cpus 1,2; ecpus 1,2; member)
OTOH, I do agree that this can be somewhat hacky. That is why I post it as a RFC to solicit feedback.
Wouldn't it be possible to make it hierarchical by adding another cpumask to cpuset which lists the cpus which are allowed in the hierarchy but not used unless claimed by an isolated domain?
Thanks.
On 4/12/23 16:22, Tejun Heo wrote:
Hello, Waiman.
On Wed, Apr 12, 2023 at 03:52:36PM -0400, Waiman Long wrote:
There is still a distribution hierarchy as the list of isolation CPUs have to be distributed down to the target cgroup through the hierarchy. For example,
cgroup root +- isolcpus (cpus 8,9; isolcpus) +- user.slice (cpus 1-9; ecpus 1-7; member) +- user-x.slice (cpus 8,9; ecpus 8,9; isolated) +- user-y.slice (cpus 1,2; ecpus 1,2; member)
OTOH, I do agree that this can be somewhat hacky. That is why I post it as a RFC to solicit feedback.
Wouldn't it be possible to make it hierarchical by adding another cpumask to cpuset which lists the cpus which are allowed in the hierarchy but not used unless claimed by an isolated domain?
I think we can. You mean having a new "cpuset.cpus.isolated" cgroupfs file. So there will be one in the root cgroup that defines all the isolated CPUs one can have. It is then distributed down the hierarchy and can be claimed only if a cgroup becomes an "isolated" partition. There will be a slight change in the semantics of an "isolated" partition, but I doubt there will be much users out there.
If you are OK with this approach, I can modify my patch series to do that.
Cheers, Longman
Hello,
On Wed, Apr 12, 2023 at 04:33:29PM -0400, Waiman Long wrote:
I think we can. You mean having a new "cpuset.cpus.isolated" cgroupfs file. So there will be one in the root cgroup that defines all the isolated CPUs one can have. It is then distributed down the hierarchy and can be claimed only if a cgroup becomes an "isolated" partition. There will be a slight
Yeah, that seems a lot more congruent with the typical pattern.
change in the semantics of an "isolated" partition, but I doubt there will be much users out there.
I haven't thought through it too hard but what prevents staying compatible with the current behavior?
If you are OK with this approach, I can modify my patch series to do that.
Thanks.
On 4/12/23 20:03, Tejun Heo wrote:
Hello,
On Wed, Apr 12, 2023 at 04:33:29PM -0400, Waiman Long wrote:
I think we can. You mean having a new "cpuset.cpus.isolated" cgroupfs file. So there will be one in the root cgroup that defines all the isolated CPUs one can have. It is then distributed down the hierarchy and can be claimed only if a cgroup becomes an "isolated" partition. There will be a slight
Yeah, that seems a lot more congruent with the typical pattern.
change in the semantics of an "isolated" partition, but I doubt there will be much users out there.
I haven't thought through it too hard but what prevents staying compatible with the current behavior?
It is possible to stay compatible with existing behavior. It is just that a break from existing behavior will make the solution more clean.
So the new behavior will be:
If the "cpuset.cpus.isolated" isn't set, the existing rules applies. If it is set, the new rule will be used.
Does that look reasonable to you?
Cheers, Longman
Hello,
On Wed, Apr 12, 2023 at 08:26:03PM -0400, Waiman Long wrote:
If the "cpuset.cpus.isolated" isn't set, the existing rules applies. If it is set, the new rule will be used.
Does that look reasonable to you?
Sounds a bit contrived. Does it need to be something defined in the root cgroup? The only thing that's needed is that a cgroup needs to claim CPUs exclusively without using them, right? Let's say we add a new interface file, say, cpuset.cpus.reserve which is always exclusive and can be consumed by children whichever way they want, wouldn't that be sufficient? Then, there would be nothing to describe in the root cgroup.
Thanks.
On 4/12/23 20:33, Tejun Heo wrote:
Hello,
On Wed, Apr 12, 2023 at 08:26:03PM -0400, Waiman Long wrote:
If the "cpuset.cpus.isolated" isn't set, the existing rules applies. If it is set, the new rule will be used.
Does that look reasonable to you?
Sounds a bit contrived. Does it need to be something defined in the root cgroup?
Yes, because we need to take away the isolated CPUs from the effective cpus of the root cgroup. So it needs to start from the root. That is also why we have the partition rule that the parent of a partition has to be a partition root itself. With the new scheme, we don't need a special cgroup to hold the isolated CPUs. The new root cgroup file will be enough to inform the system what CPUs will have to be isolated.
My current thinking is that the root's "cpuset.cpus.isolated" will start with whatever have been set in the "isolcpus" or "nohz_full" boot command line and can be extended from there but not shrank below that as there can be additional isolation attributes with those isolated CPUs.
Cheers, Longman
The only thing that's needed is that a cgroup needs to claim CPUs exclusively without using them, right? Let's say we add a new interface file, say, cpuset.cpus.reserve which is always exclusive and can be consumed by children whichever way they want, wouldn't that be sufficient? Then, there would be nothing to describe in the root cgroup.
Thanks.
Hello, Waiman.
On Wed, Apr 12, 2023 at 08:55:55PM -0400, Waiman Long wrote:
Sounds a bit contrived. Does it need to be something defined in the root cgroup?
Yes, because we need to take away the isolated CPUs from the effective cpus of the root cgroup. So it needs to start from the root. That is also why we have the partition rule that the parent of a partition has to be a partition root itself. With the new scheme, we don't need a special cgroup to hold the
I'm following. The root is already a partition root and the cgroupfs control knobs are owned by the parent, so the root cgroup would own the first level cgroups' cpuset.cpus.reserve knobs. If the root cgroup wants to assign some CPUs exclusively to a first level cgroup, it can then set that cgroup's reserve knob accordingly (or maybe the better name is cpuset.cpus.exclusive), which will take those CPUs out of the root cgroup's partition and give them to the first level cgroup. The first level cgroup then is free to do whatever with those CPUs that now belong exclusively to the cgroup subtree.
isolated CPUs. The new root cgroup file will be enough to inform the system what CPUs will have to be isolated.
My current thinking is that the root's "cpuset.cpus.isolated" will start with whatever have been set in the "isolcpus" or "nohz_full" boot command line and can be extended from there but not shrank below that as there can be additional isolation attributes with those isolated CPUs.
I'm not sure we wanna tie with those automatically. I think it'd be confusing than helpful.
Thanks.
On 4/12/23 21:17, Tejun Heo wrote:
Hello, Waiman.
On Wed, Apr 12, 2023 at 08:55:55PM -0400, Waiman Long wrote:
Sounds a bit contrived. Does it need to be something defined in the root cgroup?
Yes, because we need to take away the isolated CPUs from the effective cpus of the root cgroup. So it needs to start from the root. That is also why we have the partition rule that the parent of a partition has to be a partition root itself. With the new scheme, we don't need a special cgroup to hold the
I'm following. The root is already a partition root and the cgroupfs control knobs are owned by the parent, so the root cgroup would own the first level cgroups' cpuset.cpus.reserve knobs. If the root cgroup wants to assign some CPUs exclusively to a first level cgroup, it can then set that cgroup's reserve knob accordingly (or maybe the better name is cpuset.cpus.exclusive), which will take those CPUs out of the root cgroup's partition and give them to the first level cgroup. The first level cgroup then is free to do whatever with those CPUs that now belong exclusively to the cgroup subtree.
I am OK with the cpuset.cpus.reserve name, but not that much with the cpuset.cpus.exclusive name as it can get confused with cgroup v1's cpuset.cpu_exclusive. Of course, I prefer the cpuset.cpus.isolated name a bit more. Once an isolated CPU gets used in an isolated partition, it is exclusive and it can't be used in another isolated partition.
Since we will allow users to set cpuset.cpus.reserve to whatever value they want. The distribution of isolated CPUs is only valid if the cpus are present in its parent's cpuset.cpus.reserve and all the way up to the root. It is a bit expensive, but it should be a relatively rare operation.
isolated CPUs. The new root cgroup file will be enough to inform the system what CPUs will have to be isolated.
My current thinking is that the root's "cpuset.cpus.isolated" will start with whatever have been set in the "isolcpus" or "nohz_full" boot command line and can be extended from there but not shrank below that as there can be additional isolation attributes with those isolated CPUs.
I'm not sure we wanna tie with those automatically. I think it'd be confusing than helpful.
Yes, I am fine with taking this off for now.
Cheers, Longman
On 4/12/23 21:55, Waiman Long wrote:
On 4/12/23 21:17, Tejun Heo wrote:
Hello, Waiman.
On Wed, Apr 12, 2023 at 08:55:55PM -0400, Waiman Long wrote:
Sounds a bit contrived. Does it need to be something defined in the root cgroup?
Yes, because we need to take away the isolated CPUs from the effective cpus of the root cgroup. So it needs to start from the root. That is also why we have the partition rule that the parent of a partition has to be a partition root itself. With the new scheme, we don't need a special cgroup to hold the
I'm following. The root is already a partition root and the cgroupfs control knobs are owned by the parent, so the root cgroup would own the first level cgroups' cpuset.cpus.reserve knobs. If the root cgroup wants to assign some CPUs exclusively to a first level cgroup, it can then set that cgroup's reserve knob accordingly (or maybe the better name is cpuset.cpus.exclusive), which will take those CPUs out of the root cgroup's partition and give them to the first level cgroup. The first level cgroup then is free to do whatever with those CPUs that now belong exclusively to the cgroup subtree.
I am OK with the cpuset.cpus.reserve name, but not that much with the cpuset.cpus.exclusive name as it can get confused with cgroup v1's cpuset.cpu_exclusive. Of course, I prefer the cpuset.cpus.isolated name a bit more. Once an isolated CPU gets used in an isolated partition, it is exclusive and it can't be used in another isolated partition.
Since we will allow users to set cpuset.cpus.reserve to whatever value they want. The distribution of isolated CPUs is only valid if the cpus are present in its parent's cpuset.cpus.reserve and all the way up to the root. It is a bit expensive, but it should be a relatively rare operation.
I now have a slightly different idea of how to do that. We already have an internal cpumask for partitioning - subparts_cpus. I am thinking about exposing it as cpuset.cpus.reserve. The current way of creating subpartitions will be called automatic reservation and require a direct parent/child partition relationship. But as soon as a user write anything to it, it will break automatic reservation and require manual reservation going forward.
In that way, we can keep the old behavior, but also support new use cases. I am going to work on that.
Cheers, Longman
On Thu, Apr 13, 2023 at 09:22:19PM -0400, Waiman Long wrote:
I now have a slightly different idea of how to do that. We already have an internal cpumask for partitioning - subparts_cpus. I am thinking about exposing it as cpuset.cpus.reserve. The current way of creating subpartitions will be called automatic reservation and require a direct parent/child partition relationship. But as soon as a user write anything to it, it will break automatic reservation and require manual reservation going forward.
In that way, we can keep the old behavior, but also support new use cases. I am going to work on that.
I'm not sure I fully understand the proposed behavior but it does sound more quirky.
Thanks.
On 4/14/23 12:54, Tejun Heo wrote:
On Thu, Apr 13, 2023 at 09:22:19PM -0400, Waiman Long wrote:
I now have a slightly different idea of how to do that. We already have an internal cpumask for partitioning - subparts_cpus. I am thinking about exposing it as cpuset.cpus.reserve. The current way of creating subpartitions will be called automatic reservation and require a direct parent/child partition relationship. But as soon as a user write anything to it, it will break automatic reservation and require manual reservation going forward.
In that way, we can keep the old behavior, but also support new use cases. I am going to work on that.
I'm not sure I fully understand the proposed behavior but it does sound more quirky.
The idea is to use the existing subparts_cpus for cpu reservation instead of adding a new cpumask for that purpose. The current way of partition creation does cpus reservation (setting subparts_cpus) automatically with the constraint that the parent of a partition must be a partition root itself. One way to relax this constraint is to allow a new manual reservation mode where users can set reserve cpus manually and distribute them down the hierarchy before activating a partition to use those cpus.
Now the question is how to enable this new manual reservation mode. One way to do it is to enable it whenever the new cpuset.cpus.reserve file is modified. Alternatively, we may enable it by a cgroupfs mount option or a boot command line option.
Hope this can clarify your confusion.
Cheers, Longman
On Fri, Apr 14, 2023 at 01:29:25PM -0400, Waiman Long wrote:
On 4/14/23 12:54, Tejun Heo wrote:
On Thu, Apr 13, 2023 at 09:22:19PM -0400, Waiman Long wrote:
I now have a slightly different idea of how to do that. We already have an internal cpumask for partitioning - subparts_cpus. I am thinking about exposing it as cpuset.cpus.reserve. The current way of creating subpartitions will be called automatic reservation and require a direct parent/child partition relationship. But as soon as a user write anything to it, it will break automatic reservation and require manual reservation going forward.
In that way, we can keep the old behavior, but also support new use cases. I am going to work on that.
I'm not sure I fully understand the proposed behavior but it does sound more quirky.
The idea is to use the existing subparts_cpus for cpu reservation instead of adding a new cpumask for that purpose. The current way of partition creation does cpus reservation (setting subparts_cpus) automatically with the constraint that the parent of a partition must be a partition root itself. One way to relax this constraint is to allow a new manual reservation mode where users can set reserve cpus manually and distribute them down the hierarchy before activating a partition to use those cpus.
Now the question is how to enable this new manual reservation mode. One way to do it is to enable it whenever the new cpuset.cpus.reserve file is modified. Alternatively, we may enable it by a cgroupfs mount option or a boot command line option.
It'd probably be best if we can keep the behavior within cgroupfs if possible. Would you mind writing up the documentation section describing the behavior beforehand? I think things would be clearer if we look at it from the interface documentation side.
Thanks.
On 4/14/23 13:34, Tejun Heo wrote:
On Fri, Apr 14, 2023 at 01:29:25PM -0400, Waiman Long wrote:
On 4/14/23 12:54, Tejun Heo wrote:
On Thu, Apr 13, 2023 at 09:22:19PM -0400, Waiman Long wrote:
I now have a slightly different idea of how to do that. We already have an internal cpumask for partitioning - subparts_cpus. I am thinking about exposing it as cpuset.cpus.reserve. The current way of creating subpartitions will be called automatic reservation and require a direct parent/child partition relationship. But as soon as a user write anything to it, it will break automatic reservation and require manual reservation going forward.
In that way, we can keep the old behavior, but also support new use cases. I am going to work on that.
I'm not sure I fully understand the proposed behavior but it does sound more quirky.
The idea is to use the existing subparts_cpus for cpu reservation instead of adding a new cpumask for that purpose. The current way of partition creation does cpus reservation (setting subparts_cpus) automatically with the constraint that the parent of a partition must be a partition root itself. One way to relax this constraint is to allow a new manual reservation mode where users can set reserve cpus manually and distribute them down the hierarchy before activating a partition to use those cpus.
Now the question is how to enable this new manual reservation mode. One way to do it is to enable it whenever the new cpuset.cpus.reserve file is modified. Alternatively, we may enable it by a cgroupfs mount option or a boot command line option.
It'd probably be best if we can keep the behavior within cgroupfs if possible. Would you mind writing up the documentation section describing the behavior beforehand? I think things would be clearer if we look at it from the interface documentation side.
Sure, will do that. I need some time and so it will be early next week.
Cheers, Longman
On 4/14/23 13:38, Waiman Long wrote:
On 4/14/23 13:34, Tejun Heo wrote:
On Fri, Apr 14, 2023 at 01:29:25PM -0400, Waiman Long wrote:
On 4/14/23 12:54, Tejun Heo wrote:
On Thu, Apr 13, 2023 at 09:22:19PM -0400, Waiman Long wrote:
I now have a slightly different idea of how to do that. We already have an internal cpumask for partitioning - subparts_cpus. I am thinking about exposing it as cpuset.cpus.reserve. The current way of creating subpartitions will be called automatic reservation and require a direct parent/child partition relationship. But as soon as a user write anything to it, it will break automatic reservation and require manual reservation going forward.
In that way, we can keep the old behavior, but also support new use cases. I am going to work on that.
I'm not sure I fully understand the proposed behavior but it does sound more quirky.
The idea is to use the existing subparts_cpus for cpu reservation instead of adding a new cpumask for that purpose. The current way of partition creation does cpus reservation (setting subparts_cpus) automatically with the constraint that the parent of a partition must be a partition root itself. One way to relax this constraint is to allow a new manual reservation mode where users can set reserve cpus manually and distribute them down the hierarchy before activating a partition to use those cpus.
Now the question is how to enable this new manual reservation mode. One way to do it is to enable it whenever the new cpuset.cpus.reserve file is modified. Alternatively, we may enable it by a cgroupfs mount option or a boot command line option.
It'd probably be best if we can keep the behavior within cgroupfs if possible. Would you mind writing up the documentation section describing the behavior beforehand? I think things would be clearer if we look at it from the interface documentation side.
Sure, will do that. I need some time and so it will be early next week.
Just kidding :-)
Below is a draft of the new cpuset.cpus.reserve cgroupfs file:
cpuset.cpus.reserve A read-write multiple values file which exists on all cpuset-enabled cgroups.
It lists the reserved CPUs to be used for the creation of child partitions. See the section on "cpuset.cpus.partition" below for more information on cpuset partition. These reserved CPUs should be a subset of "cpuset.cpus" and will be mutually exclusive of "cpuset.cpus.effective" when used since these reserved CPUs cannot be used by tasks in the current cgroup.
There are two modes for partition CPUs reservation - auto or manual. The system starts up in auto mode where "cpuset.cpus.reserve" will be set automatically when valid child partitions are created and users don't need to touch the file at all. This mode has the limitation that the parent of a partition must be a partition root itself. So child partition has to be created one-by-one from the cgroup root down.
To enable the creation of a partition down in the hierarchy without the intermediate cgroups to be partition roots, one has to turn on the manual reservation mode by writing directly to "cpuset.cpus.reserve" with a value different from its current value. By distributing the reserve CPUs down the cgroup hierarchy to the parent of the target cgroup, this target cgroup can be switched to become a partition root if its "cpuset.cpus" is a subset of the set of valid reserve CPUs in its parent. The set of valid reserve CPUs is the set that are present in all its ancestors' "cpuset.cpus.reserve" up to cgroup root and which have not been allocated to another valid partition yet.
Once manual reservation mode is enabled, a cgroup administrator must always set up "cpuset.cpus.reserve" files properly before a valid partition can be created. So this mode has more administrative overhead but with greater flexibility.
Cheers, Longman
Hello.
The previous thread arrived incomplete to me, so I respond to the last message only. Point me to a message URL if it was covered.
On Fri, Apr 14, 2023 at 03:06:27PM -0400, Waiman Long longman@redhat.com wrote:
Below is a draft of the new cpuset.cpus.reserve cgroupfs file:
cpuset.cpus.reserve A read-write multiple values file which exists on all cpuset-enabled cgroups.
It lists the reserved CPUs to be used for the creation of child partitions. See the section on "cpuset.cpus.partition" below for more information on cpuset partition. These reserved CPUs should be a subset of "cpuset.cpus" and will be mutually exclusive of "cpuset.cpus.effective" when used since these reserved CPUs cannot be used by tasks in the current cgroup.
There are two modes for partition CPUs reservation - auto or manual. The system starts up in auto mode where "cpuset.cpus.reserve" will be set automatically when valid child partitions are created and users don't need to touch the file at all. This mode has the limitation that the parent of a partition must be a partition root itself. So child partition has to be created one-by-one from the cgroup root down.
To enable the creation of a partition down in the hierarchy without the intermediate cgroups to be partition roots,
Why would be this needed? Owning a CPU (a resource) must logically be passed all the way from root to the target cgroup, i.e. this is expressed by valid partitioning down to given level.
one
has to turn on the manual reservation mode by writing directly to "cpuset.cpus.reserve" with a value different from its current value. By distributing the reserve CPUs down the cgroup hierarchy to the parent of the target cgroup, this target cgroup can be switched to become a partition root if its "cpuset.cpus" is a subset of the set of valid reserve CPUs in its parent.
level n `- level n+1 cpuset.cpus // these are actually configured by "owner" of level n cpuset.cpus.partition // similrly here, level n decides if child is a partition
I.e. what would be level n/cpuset.cpus.reserve good for when it can directly control level n+1/cpuset.cpus?
Thanks, Michal
On 5/2/23 14:01, Michal Koutný wrote:
Hello.
The previous thread arrived incomplete to me, so I respond to the last message only. Point me to a message URL if it was covered.
On Fri, Apr 14, 2023 at 03:06:27PM -0400, Waiman Long longman@redhat.com wrote:
Below is a draft of the new cpuset.cpus.reserve cgroupfs file:
cpuset.cpus.reserve A read-write multiple values file which exists on all cpuset-enabled cgroups.
It lists the reserved CPUs to be used for the creation of child partitions. See the section on "cpuset.cpus.partition" below for more information on cpuset partition. These reserved CPUs should be a subset of "cpuset.cpus" and will be mutually exclusive of "cpuset.cpus.effective" when used since these reserved CPUs cannot be used by tasks in the current cgroup.
There are two modes for partition CPUs reservation - auto or manual. The system starts up in auto mode where "cpuset.cpus.reserve" will be set automatically when valid child partitions are created and users don't need to touch the file at all. This mode has the limitation that the parent of a partition must be a partition root itself. So child partition has to be created one-by-one from the cgroup root down.
To enable the creation of a partition down in the hierarchy without the intermediate cgroups to be partition roots,
Why would be this needed? Owning a CPU (a resource) must logically be passed all the way from root to the target cgroup, i.e. this is expressed by valid partitioning down to given level.
one
has to turn on the manual reservation mode by writing directly to "cpuset.cpus.reserve" with a value different from its current value. By distributing the reserve CPUs down the cgroup hierarchy to the parent of the target cgroup, this target cgroup can be switched to become a partition root if its "cpuset.cpus" is a subset of the set of valid reserve CPUs in its parent.
level n `- level n+1 cpuset.cpus // these are actually configured by "owner" of level n cpuset.cpus.partition // similrly here, level n decides if child is a partition
I.e. what would be level n/cpuset.cpus.reserve good for when it can directly control level n+1/cpuset.cpus?
In the new scheme, the available cpus are still directly passed down to a descendant cgroup. However, isolated CPUs (or more generally CPUs dedicated to a partition) have to be exclusive. So what the cpuset.cpus.reserve does is to identify those exclusive CPUs that can be excluded from the effective_cpus of the parent cgroups before they are claimed by a child partition. Currently this is done automatically when a child partition is created off a parent partition root. The new scheme will break it into 2 separate steps without the requirement that the parent of a partition has to be a partition root itself.
Cheers, Longman
claimed by a partition and will be excluded from the effective_cpus of the parent
On Tue, May 02, 2023 at 05:26:17PM -0400, Waiman Long longman@redhat.com wrote:
In the new scheme, the available cpus are still directly passed down to a descendant cgroup. However, isolated CPUs (or more generally CPUs dedicated to a partition) have to be exclusive. So what the cpuset.cpus.reserve does is to identify those exclusive CPUs that can be excluded from the effective_cpus of the parent cgroups before they are claimed by a child partition. Currently this is done automatically when a child partition is created off a parent partition root. The new scheme will break it into 2 separate steps without the requirement that the parent of a partition has to be a partition root itself.
new scheme 1st step: echo C >p/cpuset.cpus.reserve # p/cpuset.cpus.effective == A-C (1) 2nd step (claim): echo C' >p/c/cpuset.cpus # C'⊆C echo root >p/c/cpuset.cpus.partition
current scheme 1st step (configure): echo C >p/c/cpuset.cpus 2nd step (reserve & claim): echo root >p/c/cpuset.cpus.partition # p/cpuset.cpus.effective == A-C (2)
As long as p/c is unpopulated, (1) and (2) are equal situations. Why is the (different) two step procedure needed?
Also the relaxation of requirement of a parent being a partition confuses me -- if the parent is not a partition, i.e. it has no exclusive ownership of CPUs but it can still "give" it to children -- is child partition meant to be exclusive? (IOW can parent siblings reserve some same CPUs?)
Thanks, Michal
On 5/2/23 18:27, Michal Koutný wrote:
On Tue, May 02, 2023 at 05:26:17PM -0400, Waiman Long longman@redhat.com wrote:
In the new scheme, the available cpus are still directly passed down to a descendant cgroup. However, isolated CPUs (or more generally CPUs dedicated to a partition) have to be exclusive. So what the cpuset.cpus.reserve does is to identify those exclusive CPUs that can be excluded from the effective_cpus of the parent cgroups before they are claimed by a child partition. Currently this is done automatically when a child partition is created off a parent partition root. The new scheme will break it into 2 separate steps without the requirement that the parent of a partition has to be a partition root itself.
new scheme 1st step: echo C >p/cpuset.cpus.reserve # p/cpuset.cpus.effective == A-C (1) 2nd step (claim): echo C' >p/c/cpuset.cpus # C'⊆C echo root >p/c/cpuset.cpus.partition
It is something like that. However, the current scheme of automatic reservation is also supported, i.e. cpuset.cpus.reserve will be set automatically when the child cgroup becomes a valid partition as long as the cpuset.cpus.reserve file is not written to. This is for backward compatibility.
Once it is written to, automatic mode will end and users have to manually set it afterward.
current scheme 1st step (configure): echo C >p/c/cpuset.cpus 2nd step (reserve & claim): echo root >p/c/cpuset.cpus.partition # p/cpuset.cpus.effective == A-C (2)
As long as p/c is unpopulated, (1) and (2) are equal situations. Why is the (different) two step procedure needed?
Also the relaxation of requirement of a parent being a partition confuses me -- if the parent is not a partition, i.e. it has no exclusive ownership of CPUs but it can still "give" it to children -- is child partition meant to be exclusive? (IOW can parent siblings reserve some same CPUs?)
A valid partition root has exclusive ownership of its CPUs. That is a rule that won't be changed. As a result, an incoming partition root cannot claim CPUs that have been allocated to another partition. To simplify thing, transition to a valid partition root is not possible if any of the CPUs in its cpuset.cpus are not in the cpuset.cpus.reserve of its ancestor or have been allocated to another partition. The partition root simply becomes invalid.
The parent can virtually give the reserved CPUs from the root down the hierarchy and a child can claim them once it becomes a partition root. In manual mode, we need to check all the way up the hierarchy to the root to figure out what CPUs in cpuset.cpus.reserve are valid. It has higher overhead, but enabling partition is not a fast operation anyway.
Cheers, Longman
On Wed, May 03, 2023 at 11:01:36PM -0400, Waiman Long wrote:
On 5/2/23 18:27, Michal Koutný wrote:
On Tue, May 02, 2023 at 05:26:17PM -0400, Waiman Long longman@redhat.com wrote:
In the new scheme, the available cpus are still directly passed down to a descendant cgroup. However, isolated CPUs (or more generally CPUs dedicated to a partition) have to be exclusive. So what the cpuset.cpus.reserve does is to identify those exclusive CPUs that can be excluded from the effective_cpus of the parent cgroups before they are claimed by a child partition. Currently this is done automatically when a child partition is created off a parent partition root. The new scheme will break it into 2 separate steps without the requirement that the parent of a partition has to be a partition root itself.
new scheme 1st step: echo C >p/cpuset.cpus.reserve # p/cpuset.cpus.effective == A-C (1) 2nd step (claim): echo C' >p/c/cpuset.cpus # C'⊆C echo root >p/c/cpuset.cpus.partition
It is something like that. However, the current scheme of automatic reservation is also supported, i.e. cpuset.cpus.reserve will be set automatically when the child cgroup becomes a valid partition as long as the cpuset.cpus.reserve file is not written to. This is for backward compatibility.
Once it is written to, automatic mode will end and users have to manually set it afterward.
I really don't like the implicit switching behavior. This is interface behavior modifying internal state that userspace can't view or control directly. Regardless of how the rest of the discussion develops, this part should be improved (e.g. would it work to always try to auto-reserve if the cpu isn't already reserved?).
Thanks.
On 5/5/23 12:03, Tejun Heo wrote:
On Wed, May 03, 2023 at 11:01:36PM -0400, Waiman Long wrote:
On 5/2/23 18:27, Michal Koutný wrote:
On Tue, May 02, 2023 at 05:26:17PM -0400, Waiman Long longman@redhat.com wrote:
In the new scheme, the available cpus are still directly passed down to a descendant cgroup. However, isolated CPUs (or more generally CPUs dedicated to a partition) have to be exclusive. So what the cpuset.cpus.reserve does is to identify those exclusive CPUs that can be excluded from the effective_cpus of the parent cgroups before they are claimed by a child partition. Currently this is done automatically when a child partition is created off a parent partition root. The new scheme will break it into 2 separate steps without the requirement that the parent of a partition has to be a partition root itself.
new scheme 1st step: echo C >p/cpuset.cpus.reserve # p/cpuset.cpus.effective == A-C (1) 2nd step (claim): echo C' >p/c/cpuset.cpus # C'⊆C echo root >p/c/cpuset.cpus.partition
It is something like that. However, the current scheme of automatic reservation is also supported, i.e. cpuset.cpus.reserve will be set automatically when the child cgroup becomes a valid partition as long as the cpuset.cpus.reserve file is not written to. This is for backward compatibility.
Once it is written to, automatic mode will end and users have to manually set it afterward.
I really don't like the implicit switching behavior. This is interface behavior modifying internal state that userspace can't view or control directly. Regardless of how the rest of the discussion develops, this part should be improved (e.g. would it work to always try to auto-reserve if the cpu isn't already reserved?).
After some more thought yesterday, I have a slight change in my design that auto-reserve as it is now will stay for partitions that have a partition root parent. For remote partition that doesn't have a partition root parent, its creation will require pre-allocating additional CPUs into top_cpuset's cpuset.cpus.reserve first. So there will be no change in behavior for existing use cases whether a remote partition is created or not.
Cheers, Longman
Hi,
The following is the proposed text for "cpuset.cpus.reserve" and "cpuset.cpus.partition" of the new cpuset partition in Documentation/admin-guide/cgroup-v2.rst.
cpuset.cpus.reserve A read-write multiple values file which exists only on root cgroup.
It lists all the CPUs that are reserved for adjacent and remote partitions created in the system. See the next section for more information on what an adjacent or remote partitions is.
Creation of adjacent partition does not require touching this control file as CPU reservation will be done automatically. In order to create a remote partition, the CPUs needed by the remote partition has to be written to this file first.
A "+" prefix can be used to indicate a list of additional CPUs that are to be added without disturbing the CPUs that are originally there. For example, if its current value is "3-4", echoing ""+5" to it will change it to "3-5".
Once a remote partition is destroyed, its CPUs have to be removed from this file or no other process can use them. A "-" prefix can be used to remove a list of CPUs from it. However, removing CPUs that are currently used in existing partitions may cause those partitions to become invalid. A single "-" character without any number can be used to indicate removal of all the free CPUs not allocated to any partitions to avoid accidental partition invalidation.
cpuset.cpus.partition A read-write single value file which exists on non-root cpuset-enabled cgroups. This flag is owned by the parent cgroup and is not delegatable.
It accepts only the following input values when written to.
========== ===================================== "member" Non-root member of a partition "root" Partition root "isolated" Partition root without load balancing ========== =====================================
A cpuset partition is a collection of cgroups with a partition root at the top of the hierarchy and its descendants except those that are separate partition roots themselves and their descendants. A partition has exclusive access to the set of CPUs allocated to it. Other cgroups outside of that partition cannot use any CPUs in that set.
There are two types of partitions - adjacent and remote. The parent of an adjacent partition must be a valid partition root. Partition roots of adjacent partitions are all clustered around the root cgroup. Creation of adjacent partition is done by writing the desired partition type into "cpuset.cpus.partition".
A remote partition does not require a partition root parent. So a remote partition can be formed far from the root cgroup. However, its creation is a 2-step process. The CPUs needed by a remote partition ("cpuset.cpus" of the partition root) has to be written into "cpuset.cpus.reserve" of the root cgroup first. After that, "isolated" can be written into "cpuset.cpus.partition" of the partition root to form a remote isolated partition which is the only supported remote partition type for now.
All remote partitions are terminal as adjacent partition cannot be created underneath it.
The root cgroup is always a partition root and its state cannot be changed. All other non-root cgroups start out as "member".
When set to "root", the current cgroup is the root of a new partition or scheduling domain.
When set to "isolated", the CPUs in that partition will be in an isolated state without any load balancing from the scheduler. Tasks placed in such a partition with multiple CPUs should be carefully distributed and bound to each of the individual CPUs for optimal performance.
The value shown in "cpuset.cpus.effective" of a partition root is the CPUs that are dedicated to that partition and not available to cgroups outside of that partittion.
A partition root ("root" or "isolated") can be in one of the two possible states - valid or invalid. An invalid partition root is in a degraded state where some state information may be retained, but behaves more like a "member".
All possible state transitions among "member", "root" and "isolated" are allowed.
On read, the "cpuset.cpus.partition" file can show the following values.
============================= ===================================== "member" Non-root member of a partition "root" Partition root "isolated" Partition root without load balancing "root invalid (<reason>)" Invalid partition root "isolated invalid (<reason>)" Invalid isolated partition root ============================= =====================================
In the case of an invalid partition root, a descriptive string on why the partition is invalid is included within parentheses.
For an adjacent partition root to be valid, the following conditions must be met.
1) The "cpuset.cpus" is exclusive with its siblings , i.e. they are not shared by any of its siblings (exclusivity rule). 2) The parent cgroup is a valid partition root. 3) The "cpuset.cpus" is not empty and must contain at least one of the CPUs from parent's "cpuset.cpus", i.e. they overlap. 4) The "cpuset.cpus.effective" cannot be empty unless there is no task associated with this partition.
For a remote partition root to be valid, the following conditions must be met.
1) The same exclusivity rule as adjacent partition root. 2) The "cpuset.cpus" is not empty and all the CPUs must be present in "cpuset.cpus.reserve" of the root cgroup and none of them are allocated to another partition. 3) The "cpuset.cpus" value must be present in all its ancestors to ensure proper hierarchical cpu distribution.
External events like hotplug or changes to "cpuset.cpus" can cause a valid partition root to become invalid and vice versa. Note that a task cannot be moved to a cgroup with empty "cpuset.cpus.effective".
For a valid partition root with the sibling cpu exclusivity rule enabled, changes made to "cpuset.cpus" that violate the exclusivity rule will invalidate the partition as well as its sibling partitions with conflicting cpuset.cpus values. So care must be taking in changing "cpuset.cpus".
A valid non-root parent partition may distribute out all its CPUs to its child partitions when there is no task associated with it.
Care must be taken to change a valid partition root to "member" as all its child partitions, if present, will become invalid causing disruption to tasks running in those child partitions. These inactivated partitions could be recovered if their parent is switched back to a partition root with a proper set of "cpuset.cpus".
Poll and inotify events are triggered whenever the state of "cpuset.cpus.partition" changes. That includes changes caused by write to "cpuset.cpus.partition", cpu hotplug or other changes that modify the validity status of the partition. This will allow user space agents to monitor unexpected changes to "cpuset.cpus.partition" without the need to do continuous polling.
Cheers, Longman
Hello, Waiman.
On Sun, May 07, 2023 at 09:03:44PM -0400, Waiman Long wrote: ...
cpuset.cpus.reserve A read-write multiple values file which exists only on root cgroup.
It lists all the CPUs that are reserved for adjacent and remote partitions created in the system. See the next section for more information on what an adjacent or remote partitions is.
Creation of adjacent partition does not require touching this control file as CPU reservation will be done automatically. In order to create a remote partition, the CPUs needed by the remote partition has to be written to this file first.
A "+" prefix can be used to indicate a list of additional CPUs that are to be added without disturbing the CPUs that are originally there. For example, if its current value is "3-4", echoing ""+5" to it will change it to "3-5".
Once a remote partition is destroyed, its CPUs have to be removed from this file or no other process can use them. A "-" prefix can be used to remove a list of CPUs from it. However, removing CPUs that are currently used in existing partitions may cause those partitions to become invalid. A single "-" character without any number can be used to indicate removal of all the free CPUs not allocated to any partitions to avoid accidental partition invalidation.
Why is the syntax different from .cpus? Wouldn't it be better to keep them the same?
cpuset.cpus.partition A read-write single value file which exists on non-root cpuset-enabled cgroups. This flag is owned by the parent cgroup and is not delegatable.
It accepts only the following input values when written to.
========== ===================================== "member" Non-root member of a partition "root" Partition root "isolated" Partition root without load balancing ========== =====================================
A cpuset partition is a collection of cgroups with a partition root at the top of the hierarchy and its descendants except those that are separate partition roots themselves and their descendants. A partition has exclusive access to the set of CPUs allocated to it. Other cgroups outside of that partition cannot use any CPUs in that set.
There are two types of partitions - adjacent and remote. The parent of an adjacent partition must be a valid partition root. Partition roots of adjacent partitions are all clustered around the root cgroup. Creation of adjacent partition is done by writing the desired partition type into "cpuset.cpus.partition".
A remote partition does not require a partition root parent. So a remote partition can be formed far from the root cgroup. However, its creation is a 2-step process. The CPUs needed by a remote partition ("cpuset.cpus" of the partition root) has to be written into "cpuset.cpus.reserve" of the root cgroup first. After that, "isolated" can be written into "cpuset.cpus.partition" of the partition root to form a remote isolated partition which is the only supported remote partition type for now.
All remote partitions are terminal as adjacent partition cannot be created underneath it.
Can you elaborate this extra restriction a bit further?
In general, I think it'd be really helpful if the document explains the reasoning behind the design decisions. ie. Why is reserving for? What purpose does it serve that the regular isolated ones cannot? That'd help clarifying the design decisions.
Thanks.
On 5/22/23 15:49, Tejun Heo wrote:
Hello, Waiman.
Sorry for the late reply as I had been off for almost 2 weeks due to PTO.
On Sun, May 07, 2023 at 09:03:44PM -0400, Waiman Long wrote: ...
cpuset.cpus.reserve A read-write multiple values file which exists only on root cgroup.
It lists all the CPUs that are reserved for adjacent and remote partitions created in the system. See the next section for more information on what an adjacent or remote partitions is.
Creation of adjacent partition does not require touching this control file as CPU reservation will be done automatically. In order to create a remote partition, the CPUs needed by the remote partition has to be written to this file first.
A "+" prefix can be used to indicate a list of additional CPUs that are to be added without disturbing the CPUs that are originally there. For example, if its current value is "3-4", echoing ""+5" to it will change it to "3-5".
Once a remote partition is destroyed, its CPUs have to be removed from this file or no other process can use them. A "-" prefix can be used to remove a list of CPUs from it. However, removing CPUs that are currently used in existing partitions may cause those partitions to become invalid. A single "-" character without any number can be used to indicate removal of all the free CPUs not allocated to any partitions to avoid accidental partition invalidation.
Why is the syntax different from .cpus? Wouldn't it be better to keep them the same?
Unlike cpuset.cpus, cpuset.cpus.reserve is supposed to contains CPUs that are used in multiple partitions. Also automatic reservation of adjacent partitions can happen in parallel. That is why I think it will be safer if we allow incremental increase or decrease of reserve CPUs to be used for remote partitions. I will include this reasoning into the doc file.
cpuset.cpus.partition A read-write single value file which exists on non-root cpuset-enabled cgroups. This flag is owned by the parent cgroup and is not delegatable.
It accepts only the following input values when written to.
========== ===================================== "member" Non-root member of a partition "root" Partition root "isolated" Partition root without load balancing ========== =====================================
A cpuset partition is a collection of cgroups with a partition root at the top of the hierarchy and its descendants except those that are separate partition roots themselves and their descendants. A partition has exclusive access to the set of CPUs allocated to it. Other cgroups outside of that partition cannot use any CPUs in that set.
There are two types of partitions - adjacent and remote. The parent of an adjacent partition must be a valid partition root. Partition roots of adjacent partitions are all clustered around the root cgroup. Creation of adjacent partition is done by writing the desired partition type into "cpuset.cpus.partition".
A remote partition does not require a partition root parent. So a remote partition can be formed far from the root cgroup. However, its creation is a 2-step process. The CPUs needed by a remote partition ("cpuset.cpus" of the partition root) has to be written into "cpuset.cpus.reserve" of the root cgroup first. After that, "isolated" can be written into "cpuset.cpus.partition" of the partition root to form a remote isolated partition which is the only supported remote partition type for now.
All remote partitions are terminal as adjacent partition cannot be created underneath it.
Can you elaborate this extra restriction a bit further?
Are you referring to the fact that only remote isolated partitions are supported? I do not preclude the support of load balancing remote partitions. I keep it to isolated partitions for now for ease of implementation and I am not currently aware of a use case where such a remote partition type is needed.
If you are talking about remote partition being terminal. It is mainly because it can be more tricky to support hierarchical adjacent partitions underneath it especially if it is not isolated. We can certainly support it if a use case arises. I just don't want to implement code that nobody is really going to use.
BTW, with the current way the remote partition is created, it is not possible to have another remote partition underneath it.
In general, I think it'd be really helpful if the document explains the reasoning behind the design decisions. ie. Why is reserving for? What purpose does it serve that the regular isolated ones cannot? That'd help clarifying the design decisions.
I understand your concern. If you think it is better to support both types of remote partitions or hierarchical adjacent partitions underneath it for symmetry purpose, I can certain do that. It just needs to take a bit more time.
Cheers, Longman
Hello, Waiman.
On Sun, May 28, 2023 at 05:18:50PM -0400, Waiman Long wrote:
On 5/22/23 15:49, Tejun Heo wrote: Sorry for the late reply as I had been off for almost 2 weeks due to PTO.
And me too. Just moved.
Why is the syntax different from .cpus? Wouldn't it be better to keep them the same?
Unlike cpuset.cpus, cpuset.cpus.reserve is supposed to contains CPUs that are used in multiple partitions. Also automatic reservation of adjacent partitions can happen in parallel. That is why I think it will be safer if
Ah, I see, this is because cpu.reserve is only in the root cgroup, so you can't say that the knob is owned by the parent cgroup and thus access is controlled that way.
...
There are two types of partitions - adjacent and remote. The parent of an adjacent partition must be a valid partition root. Partition roots of adjacent partitions are all clustered around the root cgroup. Creation of adjacent partition is done by writing the desired partition type into "cpuset.cpus.partition".
A remote partition does not require a partition root parent. So a remote partition can be formed far from the root cgroup. However, its creation is a 2-step process. The CPUs needed by a remote partition ("cpuset.cpus" of the partition root) has to be written into "cpuset.cpus.reserve" of the root cgroup first. After that, "isolated" can be written into "cpuset.cpus.partition" of the partition root to form a remote isolated partition which is the only supported remote partition type for now.
All remote partitions are terminal as adjacent partition cannot be created underneath it.
Can you elaborate this extra restriction a bit further?
Are you referring to the fact that only remote isolated partitions are supported? I do not preclude the support of load balancing remote partitions. I keep it to isolated partitions for now for ease of implementation and I am not currently aware of a use case where such a remote partition type is needed.
If you are talking about remote partition being terminal. It is mainly because it can be more tricky to support hierarchical adjacent partitions underneath it especially if it is not isolated. We can certainly support it if a use case arises. I just don't want to implement code that nobody is really going to use.
BTW, with the current way the remote partition is created, it is not possible to have another remote partition underneath it.
The fact that the control is spread across a root-only file and per-cgroup file seems hacky to me. e.g. How would it interact with namespacing? Are there reasons why this can't be properly hierarchical other than the amount of work needed? For example:
cpuset.cpus.exclusive is a per-cgroup file and represents the mask of CPUs that the cgroup holds exclusively. The mask is always a subset of cpuset.cpus. The parent loses access to a CPU when the CPU is given to a child by setting the CPU in the child's cpus.exclusive and the CPU can't be given to more than one child. IOW, exclusive CPUs are available only to the leaf cgroups that have them set in their .exclusive file.
When a cgroup is turned into a partition, its cpuset.cpus and cpuset.cpus.exclusive should be the same. For backward compatibility, if the cgroup's parent is already a partition, cpuset will automatically attempt to add all cpus in cpuset.cpus into cpuset.cpus.exclusive.
I could well be missing something important but I'd really like to see something like the above where the reservation feature blends in with the rest of cpuset.
Thanks.
On 6/5/23 14:03, Tejun Heo wrote:
Hello, Waiman.
On Sun, May 28, 2023 at 05:18:50PM -0400, Waiman Long wrote:
On 5/22/23 15:49, Tejun Heo wrote: Sorry for the late reply as I had been off for almost 2 weeks due to PTO.
And me too. Just moved.
Why is the syntax different from .cpus? Wouldn't it be better to keep them the same?
Unlike cpuset.cpus, cpuset.cpus.reserve is supposed to contains CPUs that are used in multiple partitions. Also automatic reservation of adjacent partitions can happen in parallel. That is why I think it will be safer if
Ah, I see, this is because cpu.reserve is only in the root cgroup, so you can't say that the knob is owned by the parent cgroup and thus access is controlled that way.
...
There are two types of partitions - adjacent and remote. The parent of an adjacent partition must be a valid partition root. Partition roots of adjacent partitions are all clustered around the root cgroup. Creation of adjacent partition is done by writing the desired partition type into "cpuset.cpus.partition".
A remote partition does not require a partition root parent. So a remote partition can be formed far from the root cgroup. However, its creation is a 2-step process. The CPUs needed by a remote partition ("cpuset.cpus" of the partition root) has to be written into "cpuset.cpus.reserve" of the root cgroup first. After that, "isolated" can be written into "cpuset.cpus.partition" of the partition root to form a remote isolated partition which is the only supported remote partition type for now.
All remote partitions are terminal as adjacent partition cannot be created underneath it.
Can you elaborate this extra restriction a bit further?
Are you referring to the fact that only remote isolated partitions are supported? I do not preclude the support of load balancing remote partitions. I keep it to isolated partitions for now for ease of implementation and I am not currently aware of a use case where such a remote partition type is needed.
If you are talking about remote partition being terminal. It is mainly because it can be more tricky to support hierarchical adjacent partitions underneath it especially if it is not isolated. We can certainly support it if a use case arises. I just don't want to implement code that nobody is really going to use.
BTW, with the current way the remote partition is created, it is not possible to have another remote partition underneath it.
The fact that the control is spread across a root-only file and per-cgroup file seems hacky to me. e.g. How would it interact with namespacing? Are there reasons why this can't be properly hierarchical other than the amount of work needed? For example:
cpuset.cpus.exclusive is a per-cgroup file and represents the mask of CPUs that the cgroup holds exclusively. The mask is always a subset of cpuset.cpus. The parent loses access to a CPU when the CPU is given to a child by setting the CPU in the child's cpus.exclusive and the CPU can't be given to more than one child. IOW, exclusive CPUs are available only to the leaf cgroups that have them set in their .exclusive file.
When a cgroup is turned into a partition, its cpuset.cpus and cpuset.cpus.exclusive should be the same. For backward compatibility, if the cgroup's parent is already a partition, cpuset will automatically attempt to add all cpus in cpuset.cpus into cpuset.cpus.exclusive.
I could well be missing something important but I'd really like to see something like the above where the reservation feature blends in with the rest of cpuset.
It can certainly be made hierarchical as you suggest. It does increase complexity from both user and kernel point of view.
From the user point of view, there is one more knob to manage hierarchically which is not used that often.
From the kernel point of view, we may need to have one more cpumask per cpuset as the current subparts_cpus is used to track automatic reservation. We need another cpumask to contain extra exclusive CPUs not allocated through automatic reservation. The fact that you mention this new control file as a list of exclusively owned CPUs for this cgroup. Creating a partition is in fact allocating exclusive CPUs to a cgroup. So it kind of overlaps with the cpuset.cpus.partititon file. Can we fail a write to cpuset.cpus.exclusive if those exclusive CPUs cannot be granted or will this exclusive list is only valid if a valid partition can be formed. So we need to properly manage the dependency between these 2 control files.
Alternatively, I have no problem exposing cpuset.cpus.exclusive as a read-only file. It is a bit problematic if we need to make it writable.
As for namespacing, you do raise a good point. I was thinking mostly from a whole system point of view as the use case that I am aware of does not needs that. To allow delegation of exclusive CPUs to a child cgroup, that cgroup has to be a partition root itself. One compromise that I can think of is to only allow automatic reservation only in such a scenario. In that case, I need to support a remote load balanced partition as well and hierarchical sub-partitions underneath it. That can be done with some extra code to the existing v2 patchset without introducing too much complexity.
IOW, the use of remote partition is only allowed on the whole system level where one has access to the cgroup root. Exclusive CPUs distribution within a container can only be done via the use of adjacent partitions with automatic reservation. Will that be a good enough compromise from your point of view?
Cheers, Longman
Hello,
On Mon, Jun 05, 2023 at 04:00:39PM -0400, Waiman Long wrote: ...
file seems hacky to me. e.g. How would it interact with namespacing? Are there reasons why this can't be properly hierarchical other than the amount of work needed? For example:
cpuset.cpus.exclusive is a per-cgroup file and represents the mask of CPUs that the cgroup holds exclusively. The mask is always a subset of cpuset.cpus. The parent loses access to a CPU when the CPU is given to a child by setting the CPU in the child's cpus.exclusive and the CPU can't be given to more than one child. IOW, exclusive CPUs are available only to the leaf cgroups that have them set in their .exclusive file.
When a cgroup is turned into a partition, its cpuset.cpus and cpuset.cpus.exclusive should be the same. For backward compatibility, if the cgroup's parent is already a partition, cpuset will automatically attempt to add all cpus in cpuset.cpus into cpuset.cpus.exclusive.
I could well be missing something important but I'd really like to see something like the above where the reservation feature blends in with the rest of cpuset.
It can certainly be made hierarchical as you suggest. It does increase complexity from both user and kernel point of view.
From the user point of view, there is one more knob to manage hierarchically which is not used that often.
From user pov, this only affects them when they want to create partitions down the tree, right?
From the kernel point of view, we may need to have one more cpumask per cpuset as the current subparts_cpus is used to track automatic reservation. We need another cpumask to contain extra exclusive CPUs not allocated through automatic reservation. The fact that you mention this new control file as a list of exclusively owned CPUs for this cgroup. Creating a partition is in fact allocating exclusive CPUs to a cgroup. So it kind of overlaps with the cpuset.cpus.partititon file. Can we fail a write to
Yes, it substitutes and expands on cpuset.cpus.partition behavior.
cpuset.cpus.exclusive if those exclusive CPUs cannot be granted or will this exclusive list is only valid if a valid partition can be formed. So we need to properly manage the dependency between these 2 control files.
So, I think cpus.exclusive can become the sole mechanism to arbitrate exclusive owenership of CPUs and .partition can depend on .exclusive.
Alternatively, I have no problem exposing cpuset.cpus.exclusive as a read-only file. It is a bit problematic if we need to make it writable.
I don't follow. How would remote partitions work then?
As for namespacing, you do raise a good point. I was thinking mostly from a whole system point of view as the use case that I am aware of does not needs that. To allow delegation of exclusive CPUs to a child cgroup, that cgroup has to be a partition root itself. One compromise that I can think of is to only allow automatic reservation only in such a scenario. In that case, I need to support a remote load balanced partition as well and hierarchical sub-partitions underneath it. That can be done with some extra code to the existing v2 patchset without introducing too much complexity.
IOW, the use of remote partition is only allowed on the whole system level where one has access to the cgroup root. Exclusive CPUs distribution within a container can only be done via the use of adjacent partitions with automatic reservation. Will that be a good enough compromise from your point of view?
It seems too twisted to me. I'd much prefer it to be better integrated with the rest of cpuset.
Thanks.
On 6/5/23 16:27, Tejun Heo wrote:
Hello,
On Mon, Jun 05, 2023 at 04:00:39PM -0400, Waiman Long wrote: ...
file seems hacky to me. e.g. How would it interact with namespacing? Are there reasons why this can't be properly hierarchical other than the amount of work needed? For example:
cpuset.cpus.exclusive is a per-cgroup file and represents the mask of CPUs that the cgroup holds exclusively. The mask is always a subset of cpuset.cpus. The parent loses access to a CPU when the CPU is given to a child by setting the CPU in the child's cpus.exclusive and the CPU can't be given to more than one child. IOW, exclusive CPUs are available only to the leaf cgroups that have them set in their .exclusive file. When a cgroup is turned into a partition, its cpuset.cpus and cpuset.cpus.exclusive should be the same. For backward compatibility, if the cgroup's parent is already a partition, cpuset will automatically attempt to add all cpus in cpuset.cpus into cpuset.cpus.exclusive.
I could well be missing something important but I'd really like to see something like the above where the reservation feature blends in with the rest of cpuset.
It can certainly be made hierarchical as you suggest. It does increase complexity from both user and kernel point of view.
From the user point of view, there is one more knob to manage hierarchically which is not used that often.
From user pov, this only affects them when they want to create partitions down the tree, right?
From the kernel point of view, we may need to have one more cpumask per cpuset as the current subparts_cpus is used to track automatic reservation. We need another cpumask to contain extra exclusive CPUs not allocated through automatic reservation. The fact that you mention this new control file as a list of exclusively owned CPUs for this cgroup. Creating a partition is in fact allocating exclusive CPUs to a cgroup. So it kind of overlaps with the cpuset.cpus.partititon file. Can we fail a write to
Yes, it substitutes and expands on cpuset.cpus.partition behavior.
cpuset.cpus.exclusive if those exclusive CPUs cannot be granted or will this exclusive list is only valid if a valid partition can be formed. So we need to properly manage the dependency between these 2 control files.
So, I think cpus.exclusive can become the sole mechanism to arbitrate exclusive owenership of CPUs and .partition can depend on .exclusive.
Alternatively, I have no problem exposing cpuset.cpus.exclusive as a read-only file. It is a bit problematic if we need to make it writable.
I don't follow. How would remote partitions work then?
I had a different idea on the semantics of the cpuset.cpus.exclusive at the beginning. My original thinking is that it was the actual exclusive CPUs that are allocated to the cgroup. Now if we treat this as a hint of what exclusive CPUs should be used and it becomes valid only if the cgroup can become a valid partition. I can see it as a value that can be hierarchically set throughout the whole cpuset hierarchy.
So a transition to a valid partition is possible iff
1) cpuset.cpus.exclusive is a subset of cpuset.cpus and is a subset of cpuset.cpus.exclusive of all its ancestors. 2) If its parent is not a partition root, none of the CPUs in cpuset.cpus.exclusive are currently allocated to other partitions. This the same remote partition concept in my v2 patch. If its parent is a partition root, part of its exclusive CPUs will be distributed to this child partition like the current behavior of cpuset partition.
I can rework my patch to adopt this model if it is what you have in mind.
Thanks, Longman
Hello, Waiman.
On Mon, Jun 05, 2023 at 10:47:08PM -0400, Waiman Long wrote: ...
I had a different idea on the semantics of the cpuset.cpus.exclusive at the beginning. My original thinking is that it was the actual exclusive CPUs that are allocated to the cgroup. Now if we treat this as a hint of what exclusive CPUs should be used and it becomes valid only if the cgroup can
I wouldn't call it a hint. It's still hard allocation of the CPUs to the cgroups that own them. Setting up a partition requires exclusive CPUs and thus would depend on exclusive allocations set up accordingly.
become a valid partition. I can see it as a value that can be hierarchically set throughout the whole cpuset hierarchy.
So a transition to a valid partition is possible iff
- cpuset.cpus.exclusive is a subset of cpuset.cpus and is a subset of
cpuset.cpus.exclusive of all its ancestors.
Yes.
- If its parent is not a partition root, none of the CPUs in
cpuset.cpus.exclusive are currently allocated to other partitions. This the
Not just that, the CPUs aren't available to cgroups which don't have them set in the .exclusive file. IOW, if a CPU is in cpus.exclusive of some cgroups, it shouldn't appear in cpus.effective of cgroups which don't have the CPU in their cpus.exclusive.
So, .exclusive explicitly establishes exclusive ownership of CPUs and partitions depend on that with an implicit "turn CPUs exclusive" behavior in case the parent is a partition root for backward compatibility.
same remote partition concept in my v2 patch. If its parent is a partition root, part of its exclusive CPUs will be distributed to this child partition like the current behavior of cpuset partition.
Yes, similar in a sense. Please do away with the "once .reserve is used, the behavior is switched" part. Instead, it can be sth like "if the parent is a partition root, cpuset implicitly tries to set all CPUs in its cpus file in its cpus.exclusive file" so that user-visible behavior stays unchanged depending on past history.
Thanks.
On 6/6/23 15:58, Tejun Heo wrote:
Hello, Waiman.
On Mon, Jun 05, 2023 at 10:47:08PM -0400, Waiman Long wrote: ...
I had a different idea on the semantics of the cpuset.cpus.exclusive at the beginning. My original thinking is that it was the actual exclusive CPUs that are allocated to the cgroup. Now if we treat this as a hint of what exclusive CPUs should be used and it becomes valid only if the cgroup can
I wouldn't call it a hint. It's still hard allocation of the CPUs to the cgroups that own them. Setting up a partition requires exclusive CPUs and thus would depend on exclusive allocations set up accordingly.
become a valid partition. I can see it as a value that can be hierarchically set throughout the whole cpuset hierarchy.
So a transition to a valid partition is possible iff
- cpuset.cpus.exclusive is a subset of cpuset.cpus and is a subset of
cpuset.cpus.exclusive of all its ancestors.
Yes.
- If its parent is not a partition root, none of the CPUs in
cpuset.cpus.exclusive are currently allocated to other partitions. This the
Not just that, the CPUs aren't available to cgroups which don't have them set in the .exclusive file. IOW, if a CPU is in cpus.exclusive of some cgroups, it shouldn't appear in cpus.effective of cgroups which don't have the CPU in their cpus.exclusive.
So, .exclusive explicitly establishes exclusive ownership of CPUs and partitions depend on that with an implicit "turn CPUs exclusive" behavior in case the parent is a partition root for backward compatibility.
The current CPU exclusive behavior is limited to sibling cgroups only. Because of the hierarchical nature of cpu distribution, the set of exclusive CPUs have to appear in all its ancestors. When partition is enabled, we do a sibling exclusivity test at that point to verify that it is exclusive. It looks like you want to do an exclusivity test even when the partition isn't active. I can certainly do that when the file is being updated. However, it will fail the write if the exclusivity test fails just like the v1 cpuset.cpus.exclusive flag if you are OK with that.
same remote partition concept in my v2 patch. If its parent is a partition root, part of its exclusive CPUs will be distributed to this child partition like the current behavior of cpuset partition.
Yes, similar in a sense. Please do away with the "once .reserve is used, the behavior is switched" part.
That behavior has been gone in my v2 patch.
Instead, it can be sth like "if the parent is a partition root, cpuset implicitly tries to set all CPUs in its cpus file in its cpus.exclusive file" so that user-visible behavior stays unchanged depending on past history.
If parent is a partition root, auto reservation will be done and cpus.exclusive will be set automatically just like before. So existing applications using partition will not be affected.
Cheers, Longman
Hello,
On Tue, Jun 06, 2023 at 04:11:02PM -0400, Waiman Long wrote: ...
The current CPU exclusive behavior is limited to sibling cgroups only. Because of the hierarchical nature of cpu distribution, the set of exclusive CPUs have to appear in all its ancestors. When partition is enabled, we do a sibling exclusivity test at that point to verify that it is exclusive. It looks like you want to do an exclusivity test even when the partition isn't active. I can certainly do that when the file is being updated. However, it will fail the write if the exclusivity test fails just like the v1 cpuset.cpus.exclusive flag if you are OK with that.
Yeah, doesn't look like there's a way around it if we want to make .exclusive a feature which is useful on its own.
Instead, it can be sth like "if the parent is a partition root, cpuset implicitly tries to set all CPUs in its cpus file in its cpus.exclusive file" so that user-visible behavior stays unchanged depending on past history.
If parent is a partition root, auto reservation will be done and cpus.exclusive will be set automatically just like before. So existing applications using partition will not be affected.
Sounds great.
Thanks.
linux-kselftest-mirror@lists.linaro.org