v2: - Add a new patch 1 that fixes a bug introduced by recent v6.2 commit 7a2127e66a00 ("cpuset: Call set_cpus_allowed_ptr() with appropriate mask for task"). - Make a small twist and additional comment to patch 2 ("cgroup/cpuset: Skip task update if hotplug doesn't affect current cpuset") as suggested by Michal. - Remove v1 patches 3/4 for now for further discussion.
This patch series includes miscellaneous update to the cpuset and its testing code.
Patch 1 fixes a bug caused by commit 7a2127e66a00 ("cpuset: Call set_cpus_allowed_ptr() with appropriate mask for task") in the partition handling code. This fix was verified by running the test_cpuset_prs.sh test.
Patch 2 is for a hotplug optimization.
Patch 3 is actually a follow-up of commit 3fb906e7fabb ("cgroup/cpuset: Don't filter offline CPUs in cpuset_cpus_allowed() for top cpuset tasks").
Patch 4 reduces verbosity when running test_cpuset_prs.sh test script unless explicitly enabled with the -v option.
Waiman Long (4): cgroup/cpuset: Fix partition root's cpuset.cpus update bug cgroup/cpuset: Skip task update if hotplug doesn't affect current cpuset cgroup/cpuset: Include offline CPUs when tasks' cpumasks in top_cpuset are updated cgroup/cpuset: Minor updates to test_cpuset_prs.sh
kernel/cgroup/cpuset.c | 38 +++++++++++++------ .../selftests/cgroup/test_cpuset_prs.sh | 25 ++++++------ 2 files changed, 41 insertions(+), 22 deletions(-)
It was found that commit 7a2127e66a00 ("cpuset: Call set_cpus_allowed_ptr() with appropriate mask for task") introduced a bug that corrupted "cpuset.cpus" of a partition root when it was updated.
It is because the tmp->new_cpus field of the passed tmp parameter of update_parent_subparts_cpumask() should not be used at all as it contains important cpumask data that should not be overwritten. Fix it by using tmp->addmask instead.
Also update update_cpumask() to make sure that trialcs->cpu_allowed will not be corrupted until it is no longer needed.
Fixes: 7a2127e66a00 ("cpuset: Call set_cpus_allowed_ptr() with appropriate mask for task") Signed-off-by: Waiman Long longman@redhat.com --- kernel/cgroup/cpuset.c | 12 ++++++++++-- 1 file changed, 10 insertions(+), 2 deletions(-)
diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c index 636f1c682ac0..f310915d1257 100644 --- a/kernel/cgroup/cpuset.c +++ b/kernel/cgroup/cpuset.c @@ -1513,7 +1513,7 @@ static int update_parent_subparts_cpumask(struct cpuset *cs, int cmd, spin_unlock_irq(&callback_lock);
if (adding || deleting) - update_tasks_cpumask(parent, tmp->new_cpus); + update_tasks_cpumask(parent, tmp->addmask);
/* * Set or clear CS_SCHED_LOAD_BALANCE when partcmd_update, if necessary. @@ -1770,10 +1770,13 @@ static int update_cpumask(struct cpuset *cs, struct cpuset *trialcs, /* * Use the cpumasks in trialcs for tmpmasks when they are pointers * to allocated cpumasks. + * + * Note that update_parent_subparts_cpumask() uses only addmask & + * delmask, but not new_cpus. */ tmp.addmask = trialcs->subparts_cpus; tmp.delmask = trialcs->effective_cpus; - tmp.new_cpus = trialcs->cpus_allowed; + tmp.new_cpus = NULL; #endif
retval = validate_change(cs, trialcs); @@ -1838,6 +1841,11 @@ static int update_cpumask(struct cpuset *cs, struct cpuset *trialcs, } spin_unlock_irq(&callback_lock);
+#ifdef CONFIG_CPUMASK_OFFSTACK + /* Now trialcs->cpus_allowed is available */ + tmp.new_cpus = trialcs->cpus_allowed; +#endif + /* effective_cpus will be updated here */ update_cpumasks_hier(cs, &tmp, false);
On Fri, Mar 17, 2023 at 11:15:05AM -0400, Waiman Long wrote:
It was found that commit 7a2127e66a00 ("cpuset: Call set_cpus_allowed_ptr() with appropriate mask for task") introduced a bug that corrupted "cpuset.cpus" of a partition root when it was updated.
It is because the tmp->new_cpus field of the passed tmp parameter of update_parent_subparts_cpumask() should not be used at all as it contains important cpumask data that should not be overwritten. Fix it by using tmp->addmask instead.
Also update update_cpumask() to make sure that trialcs->cpu_allowed will not be corrupted until it is no longer needed.
Fixes: 7a2127e66a00 ("cpuset: Call set_cpus_allowed_ptr() with appropriate mask for task") Signed-off-by: Waiman Long longman@redhat.com
Applied to cgroup/for-6.3-fixes w/ stable cc'd.
Thanks.
If a hotplug event doesn't affect the current cpuset, there is no point to call hotplug_update_tasks() or hotplug_update_tasks_legacy(). So just skip it.
Signed-off-by: Waiman Long longman@redhat.com Reviewed-by: Michal Koutný mkoutny@suse.com --- kernel/cgroup/cpuset.c | 3 +++ 1 file changed, 3 insertions(+)
diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c index f310915d1257..5b8d763555b0 100644 --- a/kernel/cgroup/cpuset.c +++ b/kernel/cgroup/cpuset.c @@ -3516,6 +3516,8 @@ static void cpuset_hotplug_update_tasks(struct cpuset *cs, struct tmpmasks *tmp) update_tasks: cpus_updated = !cpumask_equal(&new_cpus, cs->effective_cpus); mems_updated = !nodes_equal(new_mems, cs->effective_mems); + if (!cpus_updated && !mems_updated) + goto unlock; /* Hotplug doesn't affect this cpuset */
if (mems_updated) check_insane_mems_config(&new_mems); @@ -3527,6 +3529,7 @@ static void cpuset_hotplug_update_tasks(struct cpuset *cs, struct tmpmasks *tmp) hotplug_update_tasks_legacy(cs, &new_cpus, &new_mems, cpus_updated, mems_updated);
+unlock: percpu_up_write(&cpuset_rwsem); }
Similar to commit 3fb906e7fabb ("group/cpuset: Don't filter offline CPUs in cpuset_cpus_allowed() for top cpuset tasks"), the whole set of possible CPUs including offline ones should be used for setting cpumasks for tasks in the top cpuset when a cpuset partition is modified as the hotplug code won't update cpumasks for tasks in the top cpuset when CPUs become online or offline.
Signed-off-by: Waiman Long longman@redhat.com --- kernel/cgroup/cpuset.c | 23 ++++++++++++++--------- 1 file changed, 14 insertions(+), 9 deletions(-)
diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c index 5b8d763555b0..db8793a2082f 100644 --- a/kernel/cgroup/cpuset.c +++ b/kernel/cgroup/cpuset.c @@ -1209,7 +1209,9 @@ void rebuild_sched_domains(void) * * Iterate through each task of @cs updating its cpus_allowed to the * effective cpuset's. As this function is called with cpuset_rwsem held, - * cpuset membership stays stable. + * cpuset membership stays stable. For top_cpuset, task_cpu_possible_mask() + * is used instead of effective_cpus to make sure all offline CPUs are also + * included as hotplug code won't update cpumasks for tasks in top_cpuset. */ static void update_tasks_cpumask(struct cpuset *cs, struct cpumask *new_cpus) { @@ -1219,15 +1221,18 @@ static void update_tasks_cpumask(struct cpuset *cs, struct cpumask *new_cpus)
css_task_iter_start(&cs->css, 0, &it); while ((task = css_task_iter_next(&it))) { - /* - * Percpu kthreads in top_cpuset are ignored - */ - if (top_cs && (task->flags & PF_KTHREAD) && - kthread_is_per_cpu(task)) - continue; + const struct cpumask *possible_mask = task_cpu_possible_mask(task);
- cpumask_and(new_cpus, cs->effective_cpus, - task_cpu_possible_mask(task)); + if (top_cs) { + /* + * Percpu kthreads in top_cpuset are ignored + */ + if ((task->flags & PF_KTHREAD) && kthread_is_per_cpu(task)) + continue; + cpumask_andnot(new_cpus, possible_mask, cs->subparts_cpus); + } else { + cpumask_and(new_cpus, possible_mask, cs->effective_cpus); + } set_cpus_allowed_ptr(task, new_cpus); } css_task_iter_end(&it);
Hello.
On Fri, Mar 17, 2023 at 11:15:07AM -0400, Waiman Long longman@redhat.com wrote:
- Iterate through each task of @cs updating its cpus_allowed to the
- effective cpuset's. As this function is called with cpuset_rwsem held,
- cpuset membership stays stable.
- cpuset membership stays stable. For top_cpuset, task_cpu_possible_mask()
- is used instead of effective_cpus to make sure all offline CPUs are also
*/
- included as hotplug code won't update cpumasks for tasks in top_cpuset.
On Wed, Mar 15, 2023 at 11:06:20AM +0100, Michal Koutný mkoutny@suse.com wrote:
I see now that it returns offlined cpus to top cpuset's tasks.
I considered only the "base" set change cs->effective_cpus -> possible_mask. (Apologies for that mistake.)
However, I now read the note about subparts_cpus
* effective_cpus contains only onlined CPUs, but subparts_cpus * may have offlined ones.
So if subpart_cpus keeps offlined CPUs, they will be subtracted from possible_mask and absent in the resulting new_cpus, i.e. undesirable for the tasks in that cpuset :-/
Michal
On 3/17/23 14:01, Michal Koutný wrote:
Hello.
On Fri, Mar 17, 2023 at 11:15:07AM -0400, Waiman Long longman@redhat.com wrote:
- Iterate through each task of @cs updating its cpus_allowed to the
- effective cpuset's. As this function is called with cpuset_rwsem held,
- cpuset membership stays stable.
- cpuset membership stays stable. For top_cpuset, task_cpu_possible_mask()
- is used instead of effective_cpus to make sure all offline CPUs are also
*/
- included as hotplug code won't update cpumasks for tasks in top_cpuset.
On Wed, Mar 15, 2023 at 11:06:20AM +0100, Michal Koutný mkoutny@suse.com wrote:
I see now that it returns offlined cpus to top cpuset's tasks.
I considered only the "base" set change cs->effective_cpus -> possible_mask. (Apologies for that mistake.)
However, I now read the note about subparts_cpus
* effective_cpus contains only onlined CPUs, but subparts_cpus * may have offlined ones.
So if subpart_cpus keeps offlined CPUs, they will be subtracted from possible_mask and absent in the resulting new_cpus, i.e. undesirable for the tasks in that cpuset :-/
A cpu will be in the subparts_cpus only if it has been given to the child partition. So when it becomes online, it will become part of the scheduling domain that child partition. Only the tasks in that child partition will get their cpumasks updated to use it, not those in the top cpuset.
Cheers, Longman
On Fri, Mar 17, 2023 at 02:05:32PM -0400, Waiman Long longman@redhat.com wrote:
A cpu will be in the subparts_cpus only if it has been given to the child partition. So when it becomes online, it will become part of the scheduling domain that child partition. Only the tasks in that child partition will get their cpumasks updated to use it, not those in the top cpuset.
Right, it's actually the difference between offlining a CPU and giving it to a sub-partition (hence a removed child (or switched to member) before CPU onlining). It's clear to me now.
Thanks, Michal
This patch makes the following minor updates to the cpuset partition testing script test_cpuset_prs.sh.
- Remove online_cpus function call as it will be called anyway on exit in cleanup. - Make the enabling of sched/verbose debugfs flag conditional on the "-v" verbose option and set DELAY_FACTOR to 2 in this case as cpuset partition operations are likely to be slowed down by enabling that.
Signed-off-by: Waiman Long longman@redhat.com --- .../selftests/cgroup/test_cpuset_prs.sh | 25 +++++++++++-------- 1 file changed, 14 insertions(+), 11 deletions(-)
diff --git a/tools/testing/selftests/cgroup/test_cpuset_prs.sh b/tools/testing/selftests/cgroup/test_cpuset_prs.sh index 75c100de90ff..2b5215cc599f 100755 --- a/tools/testing/selftests/cgroup/test_cpuset_prs.sh +++ b/tools/testing/selftests/cgroup/test_cpuset_prs.sh @@ -15,13 +15,6 @@ skip_test() {
[[ $(id -u) -eq 0 ]] || skip_test "Test must be run as root!"
-# Set sched verbose flag, if available -if [[ -d /sys/kernel/debug/sched ]] -then - # Used to restore the original setting during cleanup - SCHED_DEBUG=$(cat /sys/kernel/debug/sched/verbose) - echo Y > /sys/kernel/debug/sched/verbose -fi
# Get wait_inotify location WAIT_INOTIFY=$(cd $(dirname $0); pwd)/wait_inotify @@ -37,10 +30,14 @@ CPUS=$(lscpu | grep "^CPU(s):" | sed -e "s/.*:[[:space:]]*//") PROG=$1 VERBOSE= DELAY_FACTOR=1 +SCHED_DEBUG= while [[ "$1" = -* ]] do case "$1" in -v) VERBOSE=1 + # Enable sched/verbose can slow thing down + [[ $DELAY_FACTOR -eq 1 ]] && + DELAY_FACTOR=2 break ;; -d) DELAY_FACTOR=$2 @@ -54,6 +51,14 @@ do shift done
+# Set sched verbose flag if available when "-v" option is specified +if [[ -n "$VERBOSE" && -d /sys/kernel/debug/sched ]] +then + # Used to restore the original setting during cleanup + SCHED_DEBUG=$(cat /sys/kernel/debug/sched/verbose) + echo Y > /sys/kernel/debug/sched/verbose +fi + cd $CGROUP2 echo +cpuset > cgroup.subtree_control [[ -d test ]] || mkdir test @@ -65,7 +70,8 @@ cleanup() rmdir A1/A2/A3 A1/A2 A1 B1 > /dev/null 2>&1 cd .. rmdir test > /dev/null 2>&1 - echo "$SCHED_DEBUG" > /sys/kernel/debug/sched/verbose + [[ -n "$SCHED_DEBUG" ]] && + echo "$SCHED_DEBUG" > /sys/kernel/debug/sched/verbose }
# Pause in ms @@ -571,7 +577,6 @@ run_state_test() echo "Test $TEST[$I] failed result check!" eval echo "${$TEST[$I]}" dump_states - online_cpus exit 1 }
@@ -582,7 +587,6 @@ run_state_test() eval echo "${$TEST[$I]}" echo dump_states - online_cpus exit 1 } } @@ -594,7 +598,6 @@ run_state_test() eval echo "${$TEST[$I]}" echo dump_states - online_cpus exit 1 } }
On Fri, Mar 17, 2023 at 11:15:08AM -0400, Waiman Long wrote:
This patch makes the following minor updates to the cpuset partition testing script test_cpuset_prs.sh.
- Remove online_cpus function call as it will be called anyway on exit in cleanup.
- Make the enabling of sched/verbose debugfs flag conditional on the "-v" verbose option and set DELAY_FACTOR to 2 in this case as cpuset partition operations are likely to be slowed down by enabling that.
Signed-off-by: Waiman Long longman@redhat.com
Applied 2-4 to cgroup/for-6.4.
Thanks.
Hi Waiman,
On Fri, Mar 17, 2023 at 11:15:04AM -0400, Waiman Long wrote:
v2:
- Add a new patch 1 that fixes a bug introduced by recent v6.2 commit 7a2127e66a00 ("cpuset: Call set_cpus_allowed_ptr() with appropriate mask for task").
- Make a small twist and additional comment to patch 2 ("cgroup/cpuset: Skip task update if hotplug doesn't affect current cpuset") as suggested by Michal.
- Remove v1 patches 3/4 for now for further discussion.
This patch series includes miscellaneous update to the cpuset and its testing code.
FWIW, this series also passes my asymmetric 32-bit tests.
Cheers,
Will
On 3/28/23 09:40, Will Deacon wrote:
Hi Waiman,
On Fri, Mar 17, 2023 at 11:15:04AM -0400, Waiman Long wrote:
v2:
- Add a new patch 1 that fixes a bug introduced by recent v6.2 commit 7a2127e66a00 ("cpuset: Call set_cpus_allowed_ptr() with appropriate mask for task").
- Make a small twist and additional comment to patch 2 ("cgroup/cpuset: Skip task update if hotplug doesn't affect current cpuset") as suggested by Michal.
- Remove v1 patches 3/4 for now for further discussion.
This patch series includes miscellaneous update to the cpuset and its testing code.
FWIW, this series also passes my asymmetric 32-bit tests.
Thanks Will!
Tejun, do you have time to take a look at this series, especially the first patch which is a fix that may need to go to stable?
Cheers, Longman
linux-kselftest-mirror@lists.linaro.org