This patch series includes miscellaneous update to the cpuset and its testing code.
Patch 2 is actually a follow-up of commit 3fb906e7fabb ("cgroup/cpuset: Don't filter offline CPUs in cpuset_cpus_allowed() for top cpuset tasks").
Patches 3-4 are for handling corner cases when dealing with task_cpu_possible_mask().
Waiman Long (5): cgroup/cpuset: Skip task update if hotplug doesn't affect current cpuset cgroup/cpuset: Include offline CPUs when tasks' cpumasks in top_cpuset are updated cgroup/cpuset: Find another usable CPU if none found in current cpuset cgroup/cpuset: Add CONFIG_DEBUG_CPUSETS config for cpuset testing cgroup/cpuset: Minor updates to test_cpuset_prs.sh
init/Kconfig | 5 + kernel/cgroup/cpuset.c | 155 +++++++++++++++++- .../selftests/cgroup/test_cpuset_prs.sh | 25 +-- 3 files changed, 165 insertions(+), 20 deletions(-)
If a hotplug event doesn't affect the current cpuset, there is no point to call hotplug_update_tasks() or hotplug_update_tasks_legacy(). So just skip it.
Signed-off-by: Waiman Long longman@redhat.com --- kernel/cgroup/cpuset.c | 3 +++ 1 file changed, 3 insertions(+)
diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c index 636f1c682ac0..a801abad3bac 100644 --- a/kernel/cgroup/cpuset.c +++ b/kernel/cgroup/cpuset.c @@ -3508,6 +3508,8 @@ static void cpuset_hotplug_update_tasks(struct cpuset *cs, struct tmpmasks *tmp) update_tasks: cpus_updated = !cpumask_equal(&new_cpus, cs->effective_cpus); mems_updated = !nodes_equal(new_mems, cs->effective_mems); + if (!cpus_updated && !mems_updated) + goto unlock; /* Hotplug doesn't affect this cpuset */
if (mems_updated) check_insane_mems_config(&new_mems); @@ -3519,6 +3521,7 @@ static void cpuset_hotplug_update_tasks(struct cpuset *cs, struct tmpmasks *tmp) hotplug_update_tasks_legacy(cs, &new_cpus, &new_mems, cpus_updated, mems_updated);
+unlock: percpu_up_write(&cpuset_rwsem); }
On Mon, Mar 06, 2023 at 03:08:45PM -0500, Waiman Long longman@redhat.com wrote:
If a hotplug event doesn't affect the current cpuset, there is no point to call hotplug_update_tasks() or hotplug_update_tasks_legacy(). So just skip it.
This skips "insane" modification of cs->cpus_allowed in hotplug_update_tasks_legacy() but assuming cs->cpus_allowed is kept in sync with cs->effective_cpus on v1, it is OK to skip the update based only on effective_cpus check.
Hence,
kernel/cgroup/cpuset.c | 3 +++ 1 file changed, 3 insertions(+)
Reviewed-by: Michal Koutný mkoutny@suse.com
On 3/14/23 12:50, Michal Koutný wrote:
On Mon, Mar 06, 2023 at 03:08:45PM -0500, Waiman Long longman@redhat.com wrote:
If a hotplug event doesn't affect the current cpuset, there is no point to call hotplug_update_tasks() or hotplug_update_tasks_legacy(). So just skip it.
This skips "insane" modification of cs->cpus_allowed in hotplug_update_tasks_legacy() but assuming cs->cpus_allowed is kept in sync with cs->effective_cpus on v1, it is OK to skip the update based only on effective_cpus check.
Yes, effective_cpus is equivalent to cpus_allowed in v1 unless you mount the cpuset with the cpuset_v2_mode flag which will behave more like v2 where effective_cpus is still the key.
Cheers, Longman
Similar to commit 3fb906e7fabb ("group/cpuset: Don't filter offline CPUs in cpuset_cpus_allowed() for top cpuset tasks"), the whole set of possible CPUs including offline ones should be used for setting cpumasks for tasks in the top cpuset when a cpuset partition is modified.
Signed-off-by: Waiman Long longman@redhat.com --- kernel/cgroup/cpuset.c | 22 +++++++++++++--------- 1 file changed, 13 insertions(+), 9 deletions(-)
diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c index a801abad3bac..bbf57dcb2f68 100644 --- a/kernel/cgroup/cpuset.c +++ b/kernel/cgroup/cpuset.c @@ -1209,7 +1209,8 @@ void rebuild_sched_domains(void) * * Iterate through each task of @cs updating its cpus_allowed to the * effective cpuset's. As this function is called with cpuset_rwsem held, - * cpuset membership stays stable. + * cpuset membership stays stable. For top_cpuset, task_cpu_possible_mask() + * is used instead of effective_cpus. */ static void update_tasks_cpumask(struct cpuset *cs, struct cpumask *new_cpus) { @@ -1219,15 +1220,18 @@ static void update_tasks_cpumask(struct cpuset *cs, struct cpumask *new_cpus)
css_task_iter_start(&cs->css, 0, &it); while ((task = css_task_iter_next(&it))) { - /* - * Percpu kthreads in top_cpuset are ignored - */ - if (top_cs && (task->flags & PF_KTHREAD) && - kthread_is_per_cpu(task)) - continue; + const struct cpumask *possible_mask = task_cpu_possible_mask(task);
- cpumask_and(new_cpus, cs->effective_cpus, - task_cpu_possible_mask(task)); + if (top_cs) { + /* + * Percpu kthreads in top_cpuset are ignored + */ + if ((task->flags & PF_KTHREAD) && kthread_is_per_cpu(task)) + continue; + cpumask_andnot(new_cpus, possible_mask, cs->subparts_cpus); + } else { + cpumask_and(new_cpus, cs->effective_cpus, possible_mask); + } set_cpus_allowed_ptr(task, new_cpus); } css_task_iter_end(&it);
Hello Waiman.
On Mon, Mar 06, 2023 at 03:08:46PM -0500, Waiman Long longman@redhat.com wrote:
/*
* Percpu kthreads in top_cpuset are ignored
*/
if (top_cs && (task->flags & PF_KTHREAD) &&
kthread_is_per_cpu(task))
continue;
const struct cpumask *possible_mask = task_cpu_possible_mask(task);
cpumask_and(new_cpus, cs->effective_cpus,
task_cpu_possible_mask(task));
if (top_cs) {
/*
* Percpu kthreads in top_cpuset are ignored
*/
if ((task->flags & PF_KTHREAD) && kthread_is_per_cpu(task))
continue;
cpumask_andnot(new_cpus, possible_mask, cs->subparts_cpus);
} else {
cpumask_and(new_cpus, cs->effective_cpus, possible_mask);
}
I'm wrapping my head around this slightly. 1) I'd suggest swapping args in of cpumask_and() to have possible_mask consistently first. 2) Then I'm wondering whether two branches are truly different when effective_cpus := cpus_allowed - subparts_cpus top_cpuset.cpus_allowed == possible_mask (1)
IOW, can you see a difference in what affinities are set to eligible top_cpuset tasks before and after this patch upon CPU hotplug? (Hm, (1) holds only in v2. So is this a fix for v1 only?)
Thanks, Michal
On 3/14/23 13:34, Michal Koutný wrote:
Hello Waiman.
On Mon, Mar 06, 2023 at 03:08:46PM -0500, Waiman Long longman@redhat.com wrote:
/*
* Percpu kthreads in top_cpuset are ignored
*/
if (top_cs && (task->flags & PF_KTHREAD) &&
kthread_is_per_cpu(task))
continue;
const struct cpumask *possible_mask = task_cpu_possible_mask(task);
cpumask_and(new_cpus, cs->effective_cpus,
task_cpu_possible_mask(task));
if (top_cs) {
/*
* Percpu kthreads in top_cpuset are ignored
*/
if ((task->flags & PF_KTHREAD) && kthread_is_per_cpu(task))
continue;
cpumask_andnot(new_cpus, possible_mask, cs->subparts_cpus);
} else {
cpumask_and(new_cpus, cs->effective_cpus, possible_mask);
}
I'm wrapping my head around this slightly.
- I'd suggest swapping args in of cpumask_and() to have possible_mask consistently first.
I don't quite understand what you meant by "swapping args". It is effective new_cpus = cs->effective_cpus ∩ possible_mask. What is the point of swapping cs->effective_cpus and possible_mask.
- Then I'm wondering whether two branches are truly different when effective_cpus := cpus_allowed - subparts_cpus top_cpuset.cpus_allowed == possible_mask (1)
effective_cpus may not be equal "cpus_allowed - subparts_cpus" if some of the CPUs are offline as effective_cpus contains only online CPUs. subparts_cpu can include offline cpus too. That is why I choose that expression. I will add a comment to clarify that.
IOW, can you see a difference in what affinities are set to eligible top_cpuset tasks before and after this patch upon CPU hotplug? (Hm, (1) holds only in v2. So is this a fix for v1 only?)
This is due to the fact that cpu hotplug code currently doesn't update the cpu affinity of tasks in the top cpuset. Tasks not in the top cpuset can rely on the hotplug code to update the cpu affinity appropriately. For the tasks in the top cpuset, we have to make sure that all the offline CPUs are included.
Cheers, Longman
On Tue, Mar 14, 2023 at 03:02:53PM -0400, Waiman Long longman@redhat.com wrote:
cpumask_andnot(new_cpus, possible_mask, cs->subparts_cpus);
} else {
cpumask_and(new_cpus, cs->effective_cpus, possible_mask);
}
I'm wrapping my head around this slightly.
- I'd suggest swapping args in of cpumask_and() to have possible_mask consistently first.
I don't quite understand what you meant by "swapping args". It is effective new_cpus = cs->effective_cpus ∩ possible_mask. What is the point of swapping cs->effective_cpus and possible_mask.
No effect except better readability (possible_mask comes first in the other branch above too, left hand arg as the "base" that's modified).
- Then I'm wondering whether two branches are truly different when effective_cpus := cpus_allowed - subparts_cpus top_cpuset.cpus_allowed == possible_mask (1)
effective_cpus may not be equal "cpus_allowed - subparts_cpus" if some of the CPUs are offline as effective_cpus contains only online CPUs. subparts_cpu can include offline cpus too. That is why I choose that expression. I will add a comment to clarify that.
I see now that it returns offlined cpus to top cpuset's tasks.
IOW, can you see a difference in what affinities are set to eligible top_cpuset tasks before and after this patch upon CPU hotplug? (Hm, (1) holds only in v2. So is this a fix for v1 only?)
This is due to the fact that cpu hotplug code currently doesn't update the cpu affinity of tasks in the top cpuset. Tasks not in the top cpuset can rely on the hotplug code to update the cpu affinity appropriately.
Oh, I mistook this for hotplug changing behavior but it's actually for updating top_cpuset when its children becomes a partition root.
IIUC, top cpuset + hotplug has been treated specially because hotplug must have taken care of affinity regardless of cpuset. v1 allowed modification of top cpuset's mask (not sure if that worked), v2 won't allow direct top cpuset's mask modificiation but indirectly via partition root children.
So this is a continuation for 3fb906e7fabb ("cgroup/cpuset: Don't filter offline CPUs in cpuset_cpus_allowed() for top cpuset tasks") to ensure hotplug offline/online cycle won't overwrite top_cpuset tasks' affinities (when partition change during offlined period). This looks correct in this regard then. (I wish it were simpler but that's for a different/broader top cpuset+hotplug approach.)
Thanks, Michal
On 3/15/23 06:06, Michal Koutný wrote:
On Tue, Mar 14, 2023 at 03:02:53PM -0400, Waiman Long longman@redhat.com wrote:
cpumask_andnot(new_cpus, possible_mask, cs->subparts_cpus);
} else {
cpumask_and(new_cpus, cs->effective_cpus, possible_mask);
}
I'm wrapping my head around this slightly.
- I'd suggest swapping args in of cpumask_and() to have possible_mask consistently first.
I don't quite understand what you meant by "swapping args". It is effective new_cpus = cs->effective_cpus ∩ possible_mask. What is the point of swapping cs->effective_cpus and possible_mask.
No effect except better readability (possible_mask comes first in the other branch above too, left hand arg as the "base" that's modified).
OK, I got it now. I will swap it as suggested.
- Then I'm wondering whether two branches are truly different when effective_cpus := cpus_allowed - subparts_cpus top_cpuset.cpus_allowed == possible_mask (1)
effective_cpus may not be equal "cpus_allowed - subparts_cpus" if some of the CPUs are offline as effective_cpus contains only online CPUs. subparts_cpu can include offline cpus too. That is why I choose that expression. I will add a comment to clarify that.
I see now that it returns offlined cpus to top cpuset's tasks.
IOW, can you see a difference in what affinities are set to eligible top_cpuset tasks before and after this patch upon CPU hotplug? (Hm, (1) holds only in v2. So is this a fix for v1 only?)
This is due to the fact that cpu hotplug code currently doesn't update the cpu affinity of tasks in the top cpuset. Tasks not in the top cpuset can rely on the hotplug code to update the cpu affinity appropriately.
Oh, I mistook this for hotplug changing behavior but it's actually for updating top_cpuset when its children becomes a partition root.
IIUC, top cpuset + hotplug has been treated specially because hotplug must have taken care of affinity regardless of cpuset. v1 allowed modification of top cpuset's mask (not sure if that worked), v2 won't allow direct top cpuset's mask modificiation but indirectly via partition root children.
So this is a continuation for 3fb906e7fabb ("cgroup/cpuset: Don't filter offline CPUs in cpuset_cpus_allowed() for top cpuset tasks") to ensure hotplug offline/online cycle won't overwrite top_cpuset tasks' affinities (when partition change during offlined period). This looks correct in this regard then. (I wish it were simpler but that's for a different/broader top cpuset+hotplug approach.)
You can't change the content of "cpuset.cpus" in the top cpuset (both v1 & v2). I believe the CPU hotplug does not attempt to update the cpumask of tasks in the top cpuset mainly due to potential locking issue as hotplug is triggered by hardware event. Partition change, however, is a userspace event. So there shouldn't be any locking implication other than the fact per-cpu kthreads should not have their cpumasks changed.
To be consistent with commit 3fb906e7fabb ("cgroup/cpuset: Don't filter offline CPUs in cpuset_cpus_allowed() for top cpuset tasks"), similar logic needs to be applied in the later case.
Cheers, Longman
On a system with asymmetric CPUs, a restricted task is one that can run only a selected subset of available CPUs. When a CPU goes offline or when "cpuset.cpus" is changed, it is possible that a restricted task may not have any runnable CPUs left in the current cpuset even if there is still some CPUs in effective_cpus. In this case, the restricted task cannot be run at all.
There are several ways we may be able to handle this situation. Treating it like empty effective_cpus is probably too disruptive and is unfair to the normal tasks. So it is better to have some special handling for these restricted tasks. One possibility is to move the restricted tasks up the cpuset hierarchy, but it is tricky to do it right. Another solution is to assign other usable CPUs to these tasks. This patch implements the later alternative by finding one usable CPU by walking up the cpuset hierarchy and printing an informational message to let the users know that these restricted tasks are running in a cpuset with no usable CPU.
Signed-off-by: Waiman Long longman@redhat.com --- kernel/cgroup/cpuset.c | 56 +++++++++++++++++++++++++++++++++++++++++- 1 file changed, 55 insertions(+), 1 deletion(-)
diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c index bbf57dcb2f68..aa8225daf1d3 100644 --- a/kernel/cgroup/cpuset.c +++ b/kernel/cgroup/cpuset.c @@ -1202,6 +1202,38 @@ void rebuild_sched_domains(void) cpus_read_unlock(); }
+/* + * Find a usable effective (online) CPU up the cpuset hierarchy and return it. + */ +static int find_usable_cpu(struct cpuset *cs, struct cpumask *new_cpus, + const struct cpumask *possible_mask) +{ + struct cpuset *parent; + unsigned long flags; + int cpu; + + /* + * When offlining cpu, some effective_cpus may not be up to date. + * So check cpu_online_mask to be sure. + */ + parent = parent_cs(cs); + while (parent && + (!cpumask_and(new_cpus, parent->effective_cpus, possible_mask) || + !cpumask_and(new_cpus, new_cpus, cpu_online_mask))) + parent = parent_cs(cs); + + /* Fall back to all possible online cpus, if necessary */ + if (!parent) + cpumask_and(new_cpus, possible_mask, cpu_online_mask); + + /* cpumask_any_distribute() has to be called with preemption disabled */ + local_irq_save(flags); + cpu = cpumask_any_distribute(new_cpus); + local_irq_restore(flags); + + return cpu; +} + /** * update_tasks_cpumask - Update the cpumasks of tasks in the cpuset. * @cs: the cpuset in which each task's cpus_allowed mask needs to be changed @@ -1218,6 +1250,7 @@ static void update_tasks_cpumask(struct cpuset *cs, struct cpumask *new_cpus) struct task_struct *task; bool top_cs = cs == &top_cpuset;
+ percpu_rwsem_assert_held(&cpuset_rwsem); css_task_iter_start(&cs->css, 0, &it); while ((task = css_task_iter_next(&it))) { const struct cpumask *possible_mask = task_cpu_possible_mask(task); @@ -1232,7 +1265,28 @@ static void update_tasks_cpumask(struct cpuset *cs, struct cpumask *new_cpus) } else { cpumask_and(new_cpus, cs->effective_cpus, possible_mask); } - set_cpus_allowed_ptr(task, new_cpus); + /* + * On systems with assymetric CPUs, it is possible that + * cpumask will become empty or set_cpus_allowed_ptr() will + * return an error even if we still have CPUs in + * effective_cpus. In this case, we find a usable CPU walking + * up the cpuset hierarchy and use that for this particular + * task with an informational message about the change in the + * hope that the users will adjust "cpuset.cpus" accordingly. + */ + if (cpumask_empty(new_cpus) || + set_cpus_allowed_ptr(task, new_cpus)) { + char name[80]; + int cpu; + + cpu = find_usable_cpu(cs, new_cpus, possible_mask); + cpumask_clear(new_cpus); + cpumask_set_cpu(cpu, new_cpus); + WARN_ON_ONCE(set_cpus_allowed_ptr(task, new_cpus)); + cgroup_name(cs->css.cgroup, name, sizeof(name)); + pr_info("cpuset: Restricted task %s(%d) in cpuset %s is forced to run on outside CPU %d\n", + task->comm, task->pid, name, cpu); + } } css_task_iter_end(&it); }
Hello.
On Mon, Mar 06, 2023 at 03:08:47PM -0500, Waiman Long longman@redhat.com wrote:
On a system with asymmetric CPUs, a restricted task is one that can run only a selected subset of available CPUs. When a CPU goes offline or when "cpuset.cpus" is changed, it is possible that a restricted task may not have any runnable CPUs left in the current cpuset even if there is still some CPUs in effective_cpus. In this case, the restricted task cannot be run at all.
There are several ways we may be able to handle this situation. Treating it like empty effective_cpus is probably too disruptive and is unfair to the normal tasks. So it is better to have some special handling for these restricted tasks. One possibility is to move the restricted tasks up the cpuset hierarchy, but it is tricky to do it right. Another solution is to assign other usable CPUs to these tasks. This patch implements the later alternative by finding one usable CPU by walking up the cpuset hierarchy and printing an informational message to let the users know that these restricted tasks are running in a cpuset with no usable CPU.
Signed-off-by: Waiman Long longman@redhat.com
kernel/cgroup/cpuset.c | 56 +++++++++++++++++++++++++++++++++++++++++- 1 file changed, 55 insertions(+), 1 deletion(-)
diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c index bbf57dcb2f68..aa8225daf1d3 100644 --- a/kernel/cgroup/cpuset.c +++ b/kernel/cgroup/cpuset.c @@ -1202,6 +1202,38 @@ void rebuild_sched_domains(void) cpus_read_unlock(); } [...] /**
- update_tasks_cpumask - Update the cpumasks of tasks in the cpuset.
- @cs: the cpuset in which each task's cpus_allowed mask needs to be changed
@@ -1218,6 +1250,7 @@ static void update_tasks_cpumask(struct cpuset *cs, struct cpumask *new_cpus) struct task_struct *task; bool top_cs = cs == &top_cpuset;
- percpu_rwsem_assert_held(&cpuset_rwsem); css_task_iter_start(&cs->css, 0, &it); while ((task = css_task_iter_next(&it))) { const struct cpumask *possible_mask = task_cpu_possible_mask(task);
@@ -1232,7 +1265,28 @@ static void update_tasks_cpumask(struct cpuset *cs, struct cpumask *new_cpus) } else { cpumask_and(new_cpus, cs->effective_cpus, possible_mask); }
set_cpus_allowed_ptr(task, new_cpus);
/*
* On systems with assymetric CPUs, it is possible that
* cpumask will become empty or set_cpus_allowed_ptr() will
* return an error even if we still have CPUs in
* effective_cpus. In this case, we find a usable CPU walking
* up the cpuset hierarchy and use that for this particular
* task with an informational message about the change in the
* hope that the users will adjust "cpuset.cpus" accordingly.
*/
if (cpumask_empty(new_cpus) ||
set_cpus_allowed_ptr(task, new_cpus)) {
IIUC, cpumask_empty(new_cpus) here implies cpumask_empty(cs->effective_cpus) but that shouldn't happen (cs should inherit non-empty mask from an ancestor). Do I miss/forget anything?
This thus covers the case when p->user_cpus_ptr is incompatible with hotplug or cpuset.cpus allowance and a different affinity must be chosen. But doesn't that mean that the task would run _out_ of cs->effective_cpus? I guess that's unavoidable on asymmetric CPU archs but not no SMPs. Shouldn't the solution distinguish between the two? (I.e. never run out of effective_cpus on SMP.)
Thanks, Michal
On 3/14/23 14:17, Michal Koutný wrote:
Hello.
On Mon, Mar 06, 2023 at 03:08:47PM -0500, Waiman Long longman@redhat.com wrote:
On a system with asymmetric CPUs, a restricted task is one that can run only a selected subset of available CPUs. When a CPU goes offline or when "cpuset.cpus" is changed, it is possible that a restricted task may not have any runnable CPUs left in the current cpuset even if there is still some CPUs in effective_cpus. In this case, the restricted task cannot be run at all.
There are several ways we may be able to handle this situation. Treating it like empty effective_cpus is probably too disruptive and is unfair to the normal tasks. So it is better to have some special handling for these restricted tasks. One possibility is to move the restricted tasks up the cpuset hierarchy, but it is tricky to do it right. Another solution is to assign other usable CPUs to these tasks. This patch implements the later alternative by finding one usable CPU by walking up the cpuset hierarchy and printing an informational message to let the users know that these restricted tasks are running in a cpuset with no usable CPU.
Signed-off-by: Waiman Long longman@redhat.com
kernel/cgroup/cpuset.c | 56 +++++++++++++++++++++++++++++++++++++++++- 1 file changed, 55 insertions(+), 1 deletion(-)
diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c index bbf57dcb2f68..aa8225daf1d3 100644 --- a/kernel/cgroup/cpuset.c +++ b/kernel/cgroup/cpuset.c @@ -1202,6 +1202,38 @@ void rebuild_sched_domains(void) cpus_read_unlock(); } [...] /**
- update_tasks_cpumask - Update the cpumasks of tasks in the cpuset.
- @cs: the cpuset in which each task's cpus_allowed mask needs to be changed
@@ -1218,6 +1250,7 @@ static void update_tasks_cpumask(struct cpuset *cs, struct cpumask *new_cpus) struct task_struct *task; bool top_cs = cs == &top_cpuset;
- percpu_rwsem_assert_held(&cpuset_rwsem); css_task_iter_start(&cs->css, 0, &it); while ((task = css_task_iter_next(&it))) { const struct cpumask *possible_mask = task_cpu_possible_mask(task);
@@ -1232,7 +1265,28 @@ static void update_tasks_cpumask(struct cpuset *cs, struct cpumask *new_cpus) } else { cpumask_and(new_cpus, cs->effective_cpus, possible_mask); }
set_cpus_allowed_ptr(task, new_cpus);
/*
* On systems with assymetric CPUs, it is possible that
* cpumask will become empty or set_cpus_allowed_ptr() will
* return an error even if we still have CPUs in
* effective_cpus. In this case, we find a usable CPU walking
* up the cpuset hierarchy and use that for this particular
* task with an informational message about the change in the
* hope that the users will adjust "cpuset.cpus" accordingly.
*/
if (cpumask_empty(new_cpus) ||
set_cpus_allowed_ptr(task, new_cpus)) {
IIUC, cpumask_empty(new_cpus) here implies cpumask_empty(cs->effective_cpus) but that shouldn't happen (cs should inherit non-empty mask from an ancestor). Do I miss/forget anything?
This thus covers the case when p->user_cpus_ptr is incompatible with hotplug or cpuset.cpus allowance and a different affinity must be chosen. But doesn't that mean that the task would run _out_ of cs->effective_cpus? I guess that's unavoidable on asymmetric CPU archs but not no SMPs. Shouldn't the solution distinguish between the two? (I.e. never run out of effective_cpus on SMP.)
Some arm64 systems can have asymmetric CPUs where certain tasks are only runnable on a selected subset of CPUs. This information is not captured in the cpuset. As a result, task_cpu_possible_mask() may return a mask that have no overlap with effective_cpus causing new_cpus to become empty. It takes me a while to understand the implication of that. I will elaborate this point a bit more in the patch.
Cheers, Longman
On Tue, Mar 14, 2023 at 04:22:06PM -0400, Waiman Long longman@redhat.com wrote:
Some arm64 systems can have asymmetric CPUs where certain tasks are only runnable on a selected subset of CPUs.
Ah, I'm catching up.
This information is not captured in the cpuset. As a result, task_cpu_possible_mask() may return a mask that have no overlap with effective_cpus causing new_cpus to become empty.
I can see that historically, there was an approach of terminating unaccomodable tasks: 94f9c00f6460 ("arm64: Remove logic to kill 32-bit tasks on 64-bit-only cores") the removal of killing had been made possible with df950811f4a8 ("arm64: Prevent offlining first CPU with 32-bit EL0 on mismatched system").
That gives two other alternatives to affinity modification: 2) kill such tasks (not unlike OOM upon memory.max reduction), 3) reject cpuset reduction (violates cgroup v2 delegation).
What do you think about 2)?
Michal
On 3/17/23 08:27, Michal Koutný wrote:
On Tue, Mar 14, 2023 at 04:22:06PM -0400, Waiman Long longman@redhat.com wrote:
Some arm64 systems can have asymmetric CPUs where certain tasks are only runnable on a selected subset of CPUs.
Ah, I'm catching up.
This information is not captured in the cpuset. As a result, task_cpu_possible_mask() may return a mask that have no overlap with effective_cpus causing new_cpus to become empty.
I can see that historically, there was an approach of terminating unaccomodable tasks: 94f9c00f6460 ("arm64: Remove logic to kill 32-bit tasks on 64-bit-only cores") the removal of killing had been made possible with df950811f4a8 ("arm64: Prevent offlining first CPU with 32-bit EL0 on mismatched system").
That gives two other alternatives to affinity modification: 2) kill such tasks (not unlike OOM upon memory.max reduction), 3) reject cpuset reduction (violates cgroup v2 delegation).
What do you think about 2)?
Yes, killing it is one possible solution.
(3) doesn't work if the affinity change is due to hot cpu removal. So that leaves this patch or (2) as the only alternative. I would like to hear what Will and Tejun thinks about it.
I am going to remove this patch from the series for the time being.
Thanks, Longman
On Fri, Mar 17, 2023 at 10:59:26AM -0400, Waiman Long wrote:
On 3/17/23 08:27, Michal Koutný wrote:
On Tue, Mar 14, 2023 at 04:22:06PM -0400, Waiman Long longman@redhat.com wrote:
Some arm64 systems can have asymmetric CPUs where certain tasks are only runnable on a selected subset of CPUs.
Ah, I'm catching up.
This information is not captured in the cpuset. As a result, task_cpu_possible_mask() may return a mask that have no overlap with effective_cpus causing new_cpus to become empty.
I can see that historically, there was an approach of terminating unaccomodable tasks: 94f9c00f6460 ("arm64: Remove logic to kill 32-bit tasks on 64-bit-only cores") the removal of killing had been made possible with df950811f4a8 ("arm64: Prevent offlining first CPU with 32-bit EL0 on mismatched system").
That gives two other alternatives to affinity modification: 2) kill such tasks (not unlike OOM upon memory.max reduction), 3) reject cpuset reduction (violates cgroup v2 delegation).
What do you think about 2)?
Yes, killing it is one possible solution.
(3) doesn't work if the affinity change is due to hot cpu removal. So that leaves this patch or (2) as the only alternative. I would like to hear what Will and Tejun thinks about it.
The main constraint from the Android side (the lucky ecosystem where these SoCs tend to show up) is that existing userspace (including 32-bit binaries) continues to function without modification. So approaches such as killing tasks or rejecting system calls tend not to work as well, since you inevitably get divergent behaviour leading to functional breakage rather than e.g. performance anomalies.
Having said that, the behaviour we currently have in mainline seems to be alright, so please don't go out of your way to accomodate these SoCs. I'm mainly just concerned about introducing any regressions, which is why I ran my tests on this series
Cheers,
Will
On 3/24/23 10:32, Will Deacon wrote:
On Fri, Mar 17, 2023 at 10:59:26AM -0400, Waiman Long wrote:
On 3/17/23 08:27, Michal Koutný wrote:
On Tue, Mar 14, 2023 at 04:22:06PM -0400, Waiman Long longman@redhat.com wrote:
Some arm64 systems can have asymmetric CPUs where certain tasks are only runnable on a selected subset of CPUs.
Ah, I'm catching up.
This information is not captured in the cpuset. As a result, task_cpu_possible_mask() may return a mask that have no overlap with effective_cpus causing new_cpus to become empty.
I can see that historically, there was an approach of terminating unaccomodable tasks: 94f9c00f6460 ("arm64: Remove logic to kill 32-bit tasks on 64-bit-only cores") the removal of killing had been made possible with df950811f4a8 ("arm64: Prevent offlining first CPU with 32-bit EL0 on mismatched system").
That gives two other alternatives to affinity modification: 2) kill such tasks (not unlike OOM upon memory.max reduction), 3) reject cpuset reduction (violates cgroup v2 delegation).
What do you think about 2)?
Yes, killing it is one possible solution.
(3) doesn't work if the affinity change is due to hot cpu removal. So that leaves this patch or (2) as the only alternative. I would like to hear what Will and Tejun thinks about it.
The main constraint from the Android side (the lucky ecosystem where these SoCs tend to show up) is that existing userspace (including 32-bit binaries) continues to function without modification. So approaches such as killing tasks or rejecting system calls tend not to work as well, since you inevitably get divergent behaviour leading to functional breakage rather than e.g. performance anomalies.
Having said that, the behaviour we currently have in mainline seems to be alright, so please don't go out of your way to accomodate these SoCs. I'm mainly just concerned about introducing any regressions, which is why I ran my tests on this series
I agree that killing it may be too draconian. I am withholding this patch for now.
Thanks, Longman
On Fri, Mar 24, 2023 at 02:32:50PM +0000, Will Deacon will@kernel.org wrote:
So approaches such as killing tasks or rejecting system calls tend not to work as well, since you inevitably get divergent behaviour leading to functional breakage rather than e.g. performance anomalies.
What about temporary performance drop from 100% to 0% aka freezing the tasks for the duration of the mismatching affinity config?
Having said that, the behaviour we currently have in mainline seems to be alright, so please don't go out of your way to accomodate these SoCs.
I see. (Just wondering what you think about the fourth option above.)
Thanks, Michal
On 3/24/23 14:19, Michal Koutný wrote:
On Fri, Mar 24, 2023 at 02:32:50PM +0000, Will Deacon will@kernel.org wrote:
So approaches such as killing tasks or rejecting system calls tend not to work as well, since you inevitably get divergent behaviour leading to functional breakage rather than e.g. performance anomalies.
What about temporary performance drop from 100% to 0% aka freezing the tasks for the duration of the mismatching affinity config?
That can be a lot of extra work to freeze it. I will prefer something simpler.
Without this patch, I believe it will lead to a cpumask of 0 which will cause the scheduler to pick a fallback cpu. It looks like the fallback code may be able to pick up the right cpu or it may panic the system (less likely).
Cheers, Longman
Having said that, the behaviour we currently have in mainline seems to be alright, so please don't go out of your way to accomodate these SoCs.
I see. (Just wondering what you think about the fourth option above.)
Thanks, Michal
Since commit 431c69fac05b ("cpuset: Honour task_cpu_possible_mask() in guarantee_online_cpus()"), task_cpu_possible_mask() is used within the cpuset code. However, it is hard to find a arm64 system that can actually makes task_cpu_possible_mask() return different cpu mask. As a result, it is hard to exercise the correctness of the code that handle exception cases due to task_cpu_possible_mask().
To help in exercising those code paths, we need a way to force task_cpu_possible_mask() to return a different cpu mask. This patch adds a new CONFIG_DEBUG_CPUSETS config option to enable some debug code to do just that. The idea is to create a debugfs file "debug_cpu_possible_mask" that holds the cpumask to be returned by task_cpu_possible_mask() when a task with name started with the special prefix "cstest" is used as the input argument. Userspace testing code is then able to exercise the different code that is affected by task_cpu_possible_mask().
Signed-off-by: Waiman Long longman@redhat.com --- init/Kconfig | 5 +++ kernel/cgroup/cpuset.c | 76 ++++++++++++++++++++++++++++++++++++++++++ 2 files changed, 81 insertions(+)
diff --git a/init/Kconfig b/init/Kconfig index 18f0bf50c468..2abaa830aff0 100644 --- a/init/Kconfig +++ b/init/Kconfig @@ -1140,6 +1140,11 @@ config PROC_PID_CPUSET depends on CPUSETS default y
+config DEBUG_CPUSETS + bool "Enable cpuset debugging" + depends on CPUSETS && DEBUG_FS + default n + config CGROUP_DEVICE bool "Device controller" help diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c index aa8225daf1d3..45051ebb6606 100644 --- a/kernel/cgroup/cpuset.c +++ b/kernel/cgroup/cpuset.c @@ -220,6 +220,29 @@ static inline bool is_prs_invalid(int prs_state) return prs_state < 0; }
+#ifdef CONFIG_DEBUG_CPUSETS +static struct cpumask debug_cpu_possible_mask; + +/* + * Debugging code for testing code involving task_cpu_possible_mask() + */ +static inline const struct cpumask * +__task_cpu_possible_mask(struct task_struct *p) +{ + const struct cpumask *mask = task_cpu_possible_mask(p); + + if (mask != cpu_possible_mask) + return mask; + else if (!strncmp(p->comm, "cstest", 6)) + return &debug_cpu_possible_mask; + else + return cpu_possible_mask; +} + +#undef task_cpu_possible_mask +#define task_cpu_possible_mask(p) __task_cpu_possible_mask(p) +#endif /* CONFIG_DEBUG_CPUSETS */ + /* * Temporary cpumasks for working with partitions that are passed among * functions to avoid memory allocation in inner functions. @@ -4139,3 +4162,56 @@ void cpuset_task_status_allowed(struct seq_file *m, struct task_struct *task) seq_printf(m, "Mems_allowed_list:\t%*pbl\n", nodemask_pr_args(&task->mems_allowed)); } + +#ifdef CONFIG_DEBUG_CPUSETS +#include <linux/debugfs.h> + +/* + * Add a debugfs file "debug_cpu_possible_mask" that allows user to set + * a debug mask for testing. + */ +static ssize_t read_debug_mask(struct file *file, char __user *user_buf, + size_t count, loff_t *ppos) +{ + char buf[80]; + int len; + + len = snprintf(buf, sizeof(buf) - 1, "%*pbl\n", + cpumask_pr_args(&debug_cpu_possible_mask)); + return simple_read_from_buffer(user_buf, count, ppos, buf, len); +} + +static ssize_t write_debug_mask(struct file *file, const char __user *user_buf, + size_t count, loff_t *ppos) +{ + unsigned int len; + char buf[80]; + int retval = 0; + + len = min(count, sizeof(buf) - 1); + if (copy_from_user(buf, user_buf, len)) + return -EFAULT; + + if (!*buf) + cpumask_clear(&debug_cpu_possible_mask); + else + retval = cpulist_parse(buf, &debug_cpu_possible_mask); + + return (retval < 0) ? retval : count; +} + +static const struct file_operations fops_debug_mask = { + .read = read_debug_mask, + .write = write_debug_mask, + .llseek = default_llseek, +}; + +static int __init create_debug_cpu_possible_mask(void) +{ + cpumask_copy(&debug_cpu_possible_mask, cpu_possible_mask); + debugfs_create_file("debug_cpu_possible_mask", 0600, NULL, NULL, + &fops_debug_mask); + return 0; +} +late_initcall(create_debug_cpu_possible_mask); +#endif /* CONFIG_DEBUG_CPUSETS */
This patch makes the following minor updates to the cpuset partition testing script test_cpuset_prs.sh.
- Remove online_cpus function call as it will be called anyway on exit in cleanup. - Make the enabling of sched/verbose debugfs flag conditional on the "-v" verbose option and set DELAY_FACTOR to 2 in this case as cpuset partition operations are likely to be slowed down by enabling that.
Signed-off-by: Waiman Long longman@redhat.com --- .../selftests/cgroup/test_cpuset_prs.sh | 25 +++++++++++-------- 1 file changed, 14 insertions(+), 11 deletions(-)
diff --git a/tools/testing/selftests/cgroup/test_cpuset_prs.sh b/tools/testing/selftests/cgroup/test_cpuset_prs.sh index 75c100de90ff..2b5215cc599f 100755 --- a/tools/testing/selftests/cgroup/test_cpuset_prs.sh +++ b/tools/testing/selftests/cgroup/test_cpuset_prs.sh @@ -15,13 +15,6 @@ skip_test() {
[[ $(id -u) -eq 0 ]] || skip_test "Test must be run as root!"
-# Set sched verbose flag, if available -if [[ -d /sys/kernel/debug/sched ]] -then - # Used to restore the original setting during cleanup - SCHED_DEBUG=$(cat /sys/kernel/debug/sched/verbose) - echo Y > /sys/kernel/debug/sched/verbose -fi
# Get wait_inotify location WAIT_INOTIFY=$(cd $(dirname $0); pwd)/wait_inotify @@ -37,10 +30,14 @@ CPUS=$(lscpu | grep "^CPU(s):" | sed -e "s/.*:[[:space:]]*//") PROG=$1 VERBOSE= DELAY_FACTOR=1 +SCHED_DEBUG= while [[ "$1" = -* ]] do case "$1" in -v) VERBOSE=1 + # Enable sched/verbose can slow thing down + [[ $DELAY_FACTOR -eq 1 ]] && + DELAY_FACTOR=2 break ;; -d) DELAY_FACTOR=$2 @@ -54,6 +51,14 @@ do shift done
+# Set sched verbose flag if available when "-v" option is specified +if [[ -n "$VERBOSE" && -d /sys/kernel/debug/sched ]] +then + # Used to restore the original setting during cleanup + SCHED_DEBUG=$(cat /sys/kernel/debug/sched/verbose) + echo Y > /sys/kernel/debug/sched/verbose +fi + cd $CGROUP2 echo +cpuset > cgroup.subtree_control [[ -d test ]] || mkdir test @@ -65,7 +70,8 @@ cleanup() rmdir A1/A2/A3 A1/A2 A1 B1 > /dev/null 2>&1 cd .. rmdir test > /dev/null 2>&1 - echo "$SCHED_DEBUG" > /sys/kernel/debug/sched/verbose + [[ -n "$SCHED_DEBUG" ]] && + echo "$SCHED_DEBUG" > /sys/kernel/debug/sched/verbose }
# Pause in ms @@ -571,7 +577,6 @@ run_state_test() echo "Test $TEST[$I] failed result check!" eval echo "${$TEST[$I]}" dump_states - online_cpus exit 1 }
@@ -582,7 +587,6 @@ run_state_test() eval echo "${$TEST[$I]}" echo dump_states - online_cpus exit 1 } } @@ -594,7 +598,6 @@ run_state_test() eval echo "${$TEST[$I]}" echo dump_states - online_cpus exit 1 } }
On 3/7/23 01:38, Waiman Long wrote:
This patch makes the following minor updates to the cpuset partition testing script test_cpuset_prs.sh.
- Remove online_cpus function call as it will be called anyway on exit in cleanup.
- Make the enabling of sched/verbose debugfs flag conditional on the "-v" verbose option and set DELAY_FACTOR to 2 in this case as cpuset partition operations are likely to be slowed down by enabling that.
Signed-off-by: Waiman Long longman@redhat.com
Reviewed-by: Kamalesh Babulal kamalesh.babulal@oracle.com
Hi Waiman,
On Mon, Mar 06, 2023 at 03:08:44PM -0500, Waiman Long wrote:
This patch series includes miscellaneous update to the cpuset and its testing code.
Patch 2 is actually a follow-up of commit 3fb906e7fabb ("cgroup/cpuset: Don't filter offline CPUs in cpuset_cpus_allowed() for top cpuset tasks").
Patches 3-4 are for handling corner cases when dealing with task_cpu_possible_mask().
Thanks for cc'ing me on these. I ran my arm64 asymmetric tests and, fwiw, I get the same results as vanilla -rc2, so that's good.
One behaviour that persists (and which I thought might be addressed by this series) is the following. Imagine a 4-CPU system with CPUs 0-1 being 64-bit only. If I configure a parent cpuset with 'cpuset.cpus' of "0-2" and a child cpuset with 'cpuset.cpus' of "0-1", then attaching a 32-bit task to the child cpuset will result in an affinity mask of 4. If I then change 'cpuset.cpus' of the parent cpuset to "0-1,3", the affinity mask of the task remains at '4' whereas it might be nice to update it to '8', in-line with the new affinity mask of the parent cpuset.
Anyway, I'm not complaining (this is certainly _not_ a regression), but I thought I'd highlight it in case you were aiming to address this with your changes.
Cheers,
Will
On 3/15/23 12:24, Will Deacon wrote:
Hi Waiman,
On Mon, Mar 06, 2023 at 03:08:44PM -0500, Waiman Long wrote:
This patch series includes miscellaneous update to the cpuset and its testing code.
Patch 2 is actually a follow-up of commit 3fb906e7fabb ("cgroup/cpuset: Don't filter offline CPUs in cpuset_cpus_allowed() for top cpuset tasks").
Patches 3-4 are for handling corner cases when dealing with task_cpu_possible_mask().
Thanks for cc'ing me on these. I ran my arm64 asymmetric tests and, fwiw, I get the same results as vanilla -rc2, so that's good.
One behaviour that persists (and which I thought might be addressed by this series) is the following. Imagine a 4-CPU system with CPUs 0-1 being 64-bit only. If I configure a parent cpuset with 'cpuset.cpus' of "0-2" and a child cpuset with 'cpuset.cpus' of "0-1", then attaching a 32-bit task to the child cpuset will result in an affinity mask of 4. If I then change 'cpuset.cpus' of the parent cpuset to "0-1,3", the affinity mask of the task remains at '4' whereas it might be nice to update it to '8', in-line with the new affinity mask of the parent cpuset.
Anyway, I'm not complaining (this is certainly _not_ a regression), but I thought I'd highlight it in case you were aiming to address this with your changes.
I believe it is because changes in parent cpuset only won't cause the tasks in the child cpuset to be re-evaluated unless it causes a change in the effective_cpus of the child cpuset. This is the case here. We currently don't track how many tasks in the child cpusets are using parent's cpumask due to lacking runnable CPUs in the child cpuset. We can only fix this if we track those special tasks. It can be fixable, but I don't know if it is a problem that is worth fixing.
Cheers, Longman
linux-kselftest-mirror@lists.linaro.org