v5: - Rebased to the latest for-5.15 branch of cgroup git tree and drop the 1st v4 patch as it has been merged. - Update patch 1 to always allow changing partition root back to member even if it invalidates child partitions undeneath it. - Adjust the empty effective cpu partition patch to not allow 0 effective cpu for terminal partition which will make it invalid). - Add a new patch to enable reading of cpuset.cpus.partition to display the reason that causes invalid partition. - Adjust the documentation and testing patch accordingly.
v4: - Rebased to the for-5.15 branch of cgroup git tree and dropped the first 3 patches of v3 series which have been merged. - Beside prohibiting violation of cpu exclusivity rule, allow arbitrary changes to cpuset.cpus of a partition root and force the partition root to become invalid in case any of the partition root constraints are violated. The documentation file and self test are modified accordingly.
This patchset makes four enhancements to the cpuset v2 code.
Patch 1: Properly handle partition root tree and make partition invalid in case changes to cpuset.cpus violate any of the partition root constraints.
Patch 2: Enable the "cpuset.cpus.partition" file to show the reason that causes invalid partition like "root invalid (No cpu available due to hotplug)".
Patch 3: Add a new partition state "isolated" to create a partition root without load balancing. This is for handling intermitten workloads that have a strict low latency requirement.
Patch 4: Allow partition roots that are not the top cpuset to distribute all its cpus to child partitions as long as there is no task associated with that partition root. This allows more flexibility for middleware to manage multiple partitions.
Patch 5 updates the cgroup-v2.rst file accordingly. Patch 6 adds a new cpuset test to test the new cpuset partition code.
Waiman Long (6): cgroup/cpuset: Properly transition to invalid partition cgroup/cpuset: Show invalid partition reason string cgroup/cpuset: Add a new isolated cpus.partition type cgroup/cpuset: Allow non-top parent partition to distribute out all CPUs cgroup/cpuset: Update description of cpuset.cpus.partition in cgroup-v2.rst kselftest/cgroup: Add cpuset v2 partition root state test
Documentation/admin-guide/cgroup-v2.rst | 116 +-- kernel/cgroup/cpuset.c | 347 ++++++--- tools/testing/selftests/cgroup/Makefile | 5 +- .../selftests/cgroup/test_cpuset_prs.sh | 663 ++++++++++++++++++ tools/testing/selftests/cgroup/wait_inotify.c | 86 +++ 5 files changed, 1068 insertions(+), 149 deletions(-) create mode 100755 tools/testing/selftests/cgroup/test_cpuset_prs.sh create mode 100644 tools/testing/selftests/cgroup/wait_inotify.c
For cpuset partition, the special state of PRS_ERROR (invalid partition root) was originally designed to handle hotplug events. In this state, CPUs allocated to the partition root is released back to the parent but the cpuset exclusive flags remain unchanged.
Changing a cpuset into a partition root is strictly controlled. The following constraints must be satisfied in order to make the transition possible:
- The "cpuset.cpus" is not empty and the list of CPUs are exclusive, i.e. they are not shared by any of its siblings. - The parent cgroup is a partition root. - The "cpuset.cpus" is a subset of the parent's "cpuset.cpus.effective". - There is no child cgroups with cpuset enabled.
Changing a partition root back to a member is always allowed, though care must be taken to make sure that this change won't break child cpusets, if present.
Since partition root sets the CPU_EXCLUSIVE flag, cpuset.cpus changes that break the cpu exclusivity rule will not be allowed. However, other changes to cpuset.cpus on a partition root may still cause it to become invalid. So users must always check the partition root state of "cpuset.cpus.partition" after making changes to cpuset.cpus to make sure that the partition root is still valid.
For a partition root tree with parent and child partition roots, there are two cases where the child partitions can become invalid. Firstly, changing partition state to "member" will force the child partitions to become invalid.
Secondly, if some cpus are taken away from the parent partition root so that its cpuset.cpus.effective becomes empty, it will try to pull cpus away from the child partitions and force them to become invalid which may allow the parent partition to remain valid.
This patch makes sure that partitions are properly changed to invalid when some of the valid partition constraints are violated.
Signed-off-by: Waiman Long longman@redhat.com --- kernel/cgroup/cpuset.c | 177 +++++++++++++++++++++++------------------ 1 file changed, 100 insertions(+), 77 deletions(-)
diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c index 44d234b0df5e..7705950ad70b 100644 --- a/kernel/cgroup/cpuset.c +++ b/kernel/cgroup/cpuset.c @@ -1177,10 +1177,9 @@ static int update_parent_subparts_cpumask(struct cpuset *cpuset, int cmd, return -EINVAL;
/* - * Enabling/disabling partition root is not allowed if there are - * online children. + * Enabling partition root is not allowed if there are online children. */ - if ((cmd != partcmd_update) && css_has_online_children(&cpuset->css)) + if ((cmd == partcmd_enable) && css_has_online_children(&cpuset->css)) return -EBUSY;
/* @@ -1208,6 +1207,14 @@ static int update_parent_subparts_cpumask(struct cpuset *cpuset, int cmd, /* * partcmd_update with newmask: * + * Make partition invalid if newmask isn't a subset of + * (cpus_allowed | parent->effective_cpus). + */ + cpumask_or(tmp->addmask, cpuset->cpus_allowed, + parent->effective_cpus); + part_error = !cpumask_subset(newmask, tmp->addmask); + + /* * delmask = cpus_allowed & ~newmask & parent->subparts_cpus * addmask = newmask & parent->effective_cpus * & ~parent->subparts_cpus @@ -1220,20 +1227,21 @@ static int update_parent_subparts_cpumask(struct cpuset *cpuset, int cmd, adding = cpumask_andnot(tmp->addmask, tmp->addmask, parent->subparts_cpus); /* - * Return error if the new effective_cpus could become empty. + * Make partition invalid if parent's effective_cpus could + * become empty. */ if (adding && cpumask_equal(parent->effective_cpus, tmp->addmask)) { if (!deleting) - return -EINVAL; + part_error = true; /* * As some of the CPUs in subparts_cpus might have * been offlined, we need to compute the real delmask * to confirm that. */ - if (!cpumask_and(tmp->addmask, tmp->delmask, - cpu_active_mask)) - return -EINVAL; + else if (!cpumask_and(tmp->addmask, tmp->delmask, + cpu_active_mask)) + part_error = true; cpumask_copy(tmp->addmask, parent->effective_cpus); } } else { @@ -1242,19 +1250,23 @@ static int update_parent_subparts_cpumask(struct cpuset *cpuset, int cmd, * * addmask = cpus_allowed & parent->effective_cpus * + * This gets invoked either due to a hotplug event or + * from update_cpumasks_hier() where we can't return an + * error. This can cause a partition root to become invalid + * in the case of a hotplug. + * * Note that parent's subparts_cpus may have been * pre-shrunk in case there is a change in the cpu list. * So no deletion is needed. */ adding = cpumask_and(tmp->addmask, cpuset->cpus_allowed, parent->effective_cpus); - part_error = cpumask_equal(tmp->addmask, - parent->effective_cpus); + part_error = (is_partition_root(cpuset) && + !parent->nr_subparts_cpus) || + cpumask_equal(tmp->addmask, parent->effective_cpus); }
if (cmd == partcmd_update) { - int prev_prs = cpuset->partition_root_state; - /* * Check for possible transition between PRS_ENABLED * and PRS_ERROR. @@ -1269,13 +1281,9 @@ static int update_parent_subparts_cpumask(struct cpuset *cpuset, int cmd, new_prs = PRS_ENABLED; break; } - /* - * Set part_error if previously in invalid state. - */ - part_error = (prev_prs == PRS_ERROR); }
- if (!part_error && (new_prs == PRS_ERROR)) + if ((old_prs == PRS_ERROR) && (new_prs == PRS_ERROR)) return 0; /* Nothing need to be done */
if (new_prs == PRS_ERROR) { @@ -1407,6 +1415,11 @@ static void update_cpumasks_hier(struct cpuset *cs, struct tmpmasks *tmp) case PRS_ENABLED: if (update_parent_subparts_cpumask(cp, partcmd_update, NULL, tmp)) update_tasks_cpumask(parent); + /* + * The cpuset partition_root_state may be + * changed to PRS_ERROR. Capture it. + */ + new_prs = cp->partition_root_state; break;
case PRS_ERROR: @@ -1424,33 +1437,27 @@ static void update_cpumasks_hier(struct cpuset *cs, struct tmpmasks *tmp)
spin_lock_irq(&callback_lock);
- cpumask_copy(cp->effective_cpus, tmp->new_cpus); if (cp->nr_subparts_cpus && (new_prs != PRS_ENABLED)) { + /* + * Put all active subparts_cpus back to effective_cpus. + */ + cpumask_or(tmp->new_cpus, tmp->new_cpus, + cp->subparts_cpus); + cpumask_and(tmp->new_cpus, tmp->new_cpus, + cpu_active_mask); cp->nr_subparts_cpus = 0; cpumask_clear(cp->subparts_cpus); - } else if (cp->nr_subparts_cpus) { + } + + cpumask_copy(cp->effective_cpus, tmp->new_cpus); + if (cp->nr_subparts_cpus) { /* * Make sure that effective_cpus & subparts_cpus - * are mutually exclusive. - * - * In the unlikely event that effective_cpus - * becomes empty. we clear cp->nr_subparts_cpus and - * let its child partition roots to compete for - * CPUs again. + * of a partition root are mutually exclusive. */ cpumask_andnot(cp->effective_cpus, cp->effective_cpus, cp->subparts_cpus); - if (cpumask_empty(cp->effective_cpus)) { - cpumask_copy(cp->effective_cpus, tmp->new_cpus); - cpumask_clear(cp->subparts_cpus); - cp->nr_subparts_cpus = 0; - } else if (!cpumask_subset(cp->subparts_cpus, - tmp->new_cpus)) { - cpumask_andnot(cp->subparts_cpus, - cp->subparts_cpus, tmp->new_cpus); - cp->nr_subparts_cpus - = cpumask_weight(cp->subparts_cpus); - } + WARN_ON_ONCE(cpumask_empty(cp->effective_cpus)); }
if (new_prs != old_prs) @@ -1582,8 +1589,8 @@ static int update_cpumask(struct cpuset *cs, struct cpuset *trialcs, * Make sure that subparts_cpus is a subset of cpus_allowed. */ if (cs->nr_subparts_cpus) { - cpumask_andnot(cs->subparts_cpus, cs->subparts_cpus, - cs->cpus_allowed); + cpumask_and(cs->subparts_cpus, cs->subparts_cpus, + cs->cpus_allowed); cs->nr_subparts_cpus = cpumask_weight(cs->subparts_cpus); } spin_unlock_irq(&callback_lock); @@ -2005,19 +2012,33 @@ static int update_prstate(struct cpuset *cs, int new_prs) } } else { /* - * Turning off partition root will clear the - * CS_CPU_EXCLUSIVE bit. + * Switch back to member is always allowed even if it + * causes child partitions to become invalid. */ - if (old_prs == PRS_ERROR) { - update_flag(CS_CPU_EXCLUSIVE, cs, 0); - err = 0; - goto out; - } + err = 0; + update_parent_subparts_cpumask(cs, partcmd_disable, NULL, + &tmpmask); + /* + * If there are child partitions, we have to make them invalid. + */ + if (unlikely(cs->nr_subparts_cpus)) { + struct tmpmasks tmp;
- err = update_parent_subparts_cpumask(cs, partcmd_disable, - NULL, &tmpmask); - if (err) - goto out; + spin_lock_irq(&callback_lock); + cs->nr_subparts_cpus = 0; + cpumask_clear(cs->subparts_cpus); + compute_effective_cpumask(cs->effective_cpus, cs, parent); + spin_unlock_irq(&callback_lock); + + /* + * If alloc_cpumasks() fails, we are running out + * of memory and there isn't much we can do. + */ + if (!alloc_cpumasks(NULL, &tmp)) { + update_cpumasks_hier(cs, &tmp); + free_cpumasks(NULL, &tmp); + } + }
/* Turning off CS_CPU_EXCLUSIVE will not return error */ update_flag(CS_CPU_EXCLUSIVE, cs, 0); @@ -3104,11 +3125,28 @@ static void cpuset_hotplug_update_tasks(struct cpuset *cs, struct tmpmasks *tmp)
/* * In the unlikely event that a partition root has empty - * effective_cpus or its parent becomes erroneous, we have to - * transition it to the erroneous state. + * effective_cpus, we will have to force any child partitions, + * if present, to become invalid by setting nr_subparts_cpus to 0 + * without causing itself to become invalid. + */ + if (is_partition_root(cs) && cs->nr_subparts_cpus && + cpumask_empty(&new_cpus)) { + cs->nr_subparts_cpus = 0; + cpumask_clear(cs->subparts_cpus); + compute_effective_cpumask(&new_cpus, cs, parent); + } + + /* + * If empty effective_cpus or zero nr_subparts_cpus or its parent + * becomes erroneous, we have to transition it to the erroneous state. */ if (is_partition_root(cs) && (cpumask_empty(&new_cpus) || - (parent->partition_root_state == PRS_ERROR))) { + (parent->partition_root_state == PRS_ERROR) || + !parent->nr_subparts_cpus)) { + int old_prs; + + update_parent_subparts_cpumask(cs, partcmd_disable, + NULL, tmp); if (cs->nr_subparts_cpus) { spin_lock_irq(&callback_lock); cs->nr_subparts_cpus = 0; @@ -3117,38 +3155,23 @@ static void cpuset_hotplug_update_tasks(struct cpuset *cs, struct tmpmasks *tmp) compute_effective_cpumask(&new_cpus, cs, parent); }
- /* - * If the effective_cpus is empty because the child - * partitions take away all the CPUs, we can keep - * the current partition and let the child partitions - * fight for available CPUs. - */ - if ((parent->partition_root_state == PRS_ERROR) || - cpumask_empty(&new_cpus)) { - int old_prs; - - update_parent_subparts_cpumask(cs, partcmd_disable, - NULL, tmp); - old_prs = cs->partition_root_state; - if (old_prs != PRS_ERROR) { - spin_lock_irq(&callback_lock); - cs->partition_root_state = PRS_ERROR; - spin_unlock_irq(&callback_lock); - notify_partition_change(cs, old_prs, PRS_ERROR); - } + old_prs = cs->partition_root_state; + if (old_prs != PRS_ERROR) { + spin_lock_irq(&callback_lock); + cs->partition_root_state = PRS_ERROR; + spin_unlock_irq(&callback_lock); + notify_partition_change(cs, old_prs, PRS_ERROR); } cpuset_force_rebuild(); }
/* * On the other hand, an erroneous partition root may be transitioned - * back to a regular one or a partition root with no CPU allocated - * from the parent may change to erroneous. + * back to a regular one. */ - if (is_partition_root(parent) && - ((cs->partition_root_state == PRS_ERROR) || - !cpumask_intersects(&new_cpus, parent->subparts_cpus)) && - update_parent_subparts_cpumask(cs, partcmd_update, NULL, tmp)) + else if (is_partition_root(parent) && + (cs->partition_root_state == PRS_ERROR) && + update_parent_subparts_cpumask(cs, partcmd_update, NULL, tmp)) cpuset_force_rebuild();
update_tasks:
Hi Waiman,
I love your patch! Perhaps something to improve:
[auto build test WARNING on cgroup/for-next] [also build test WARNING on next-20210813] [cannot apply to kselftest/next v5.14-rc5] [If your patch is applied to the wrong git tree, kindly drop us a note. And when submitting patch, we suggest to use '--base' as documented in https://git-scm.com/docs/git-format-patch]
url: https://github.com/0day-ci/linux/commits/Waiman-Long/cgroup-cpuset-Add-new-c... base: https://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup.git for-next config: ia64-defconfig (attached as .config) compiler: ia64-linux-gcc (GCC) 11.2.0 reproduce (this is a W=1 build): wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross chmod +x ~/bin/make.cross # https://github.com/0day-ci/linux/commit/56ec7dd271c77e3cc92f0df6fd766004a7a0... git remote add linux-review https://github.com/0day-ci/linux git fetch --no-tags linux-review Waiman-Long/cgroup-cpuset-Add-new-cpuset-partition-type-empty-effecitve-cpus/20210815-014333 git checkout 56ec7dd271c77e3cc92f0df6fd766004a7a0aa88 # save the attached .config to linux build tree COMPILER_INSTALL_PATH=$HOME/0day COMPILER=gcc-11.2.0 make.cross ARCH=ia64
If you fix the issue, kindly add following tag as appropriate Reported-by: kernel test robot lkp@intel.com
All warnings (new ones prefixed by >>):
kernel/cgroup/cpuset.c: In function 'update_prstate':
kernel/cgroup/cpuset.c:2068:1: warning: the frame size of 3072 bytes is larger than 2048 bytes [-Wframe-larger-than=]
2068 | } | ^
vim +2068 kernel/cgroup/cpuset.c
^1da177e4c3f41 kernel/cpuset.c Linus Torvalds 2005-04-16 1966 ee8dde0cd2ce78 kernel/cgroup/cpuset.c Waiman Long 2018-11-08 1967 /* ee8dde0cd2ce78 kernel/cgroup/cpuset.c Waiman Long 2018-11-08 1968 * update_prstate - update partititon_root_state ee8dde0cd2ce78 kernel/cgroup/cpuset.c Waiman Long 2018-11-08 1969 * cs: the cpuset to update 0f3adb8a1e5f36 kernel/cgroup/cpuset.c Waiman Long 2021-07-20 1970 * new_prs: new partition root state ee8dde0cd2ce78 kernel/cgroup/cpuset.c Waiman Long 2018-11-08 1971 * ee8dde0cd2ce78 kernel/cgroup/cpuset.c Waiman Long 2018-11-08 1972 * Call with cpuset_mutex held. ee8dde0cd2ce78 kernel/cgroup/cpuset.c Waiman Long 2018-11-08 1973 */ 0f3adb8a1e5f36 kernel/cgroup/cpuset.c Waiman Long 2021-07-20 1974 static int update_prstate(struct cpuset *cs, int new_prs) ee8dde0cd2ce78 kernel/cgroup/cpuset.c Waiman Long 2018-11-08 1975 { 6ba34d3c73674e kernel/cgroup/cpuset.c Waiman Long 2021-07-20 1976 int err, old_prs = cs->partition_root_state; ee8dde0cd2ce78 kernel/cgroup/cpuset.c Waiman Long 2018-11-08 1977 struct cpuset *parent = parent_cs(cs); 0f3adb8a1e5f36 kernel/cgroup/cpuset.c Waiman Long 2021-07-20 1978 struct tmpmasks tmpmask; ee8dde0cd2ce78 kernel/cgroup/cpuset.c Waiman Long 2018-11-08 1979 6ba34d3c73674e kernel/cgroup/cpuset.c Waiman Long 2021-07-20 1980 if (old_prs == new_prs) ee8dde0cd2ce78 kernel/cgroup/cpuset.c Waiman Long 2018-11-08 1981 return 0; ee8dde0cd2ce78 kernel/cgroup/cpuset.c Waiman Long 2018-11-08 1982 ee8dde0cd2ce78 kernel/cgroup/cpuset.c Waiman Long 2018-11-08 1983 /* 3881b86128d0be kernel/cgroup/cpuset.c Waiman Long 2018-11-08 1984 * Cannot force a partial or invalid partition root to a full ee8dde0cd2ce78 kernel/cgroup/cpuset.c Waiman Long 2018-11-08 1985 * partition root. ee8dde0cd2ce78 kernel/cgroup/cpuset.c Waiman Long 2018-11-08 1986 */ 6ba34d3c73674e kernel/cgroup/cpuset.c Waiman Long 2021-07-20 1987 if (new_prs && (old_prs == PRS_ERROR)) ee8dde0cd2ce78 kernel/cgroup/cpuset.c Waiman Long 2018-11-08 1988 return -EINVAL; ee8dde0cd2ce78 kernel/cgroup/cpuset.c Waiman Long 2018-11-08 1989 0f3adb8a1e5f36 kernel/cgroup/cpuset.c Waiman Long 2021-07-20 1990 if (alloc_cpumasks(NULL, &tmpmask)) ee8dde0cd2ce78 kernel/cgroup/cpuset.c Waiman Long 2018-11-08 1991 return -ENOMEM; ee8dde0cd2ce78 kernel/cgroup/cpuset.c Waiman Long 2018-11-08 1992 ee8dde0cd2ce78 kernel/cgroup/cpuset.c Waiman Long 2018-11-08 1993 err = -EINVAL; 6ba34d3c73674e kernel/cgroup/cpuset.c Waiman Long 2021-07-20 1994 if (!old_prs) { ee8dde0cd2ce78 kernel/cgroup/cpuset.c Waiman Long 2018-11-08 1995 /* ee8dde0cd2ce78 kernel/cgroup/cpuset.c Waiman Long 2018-11-08 1996 * Turning on partition root requires setting the ee8dde0cd2ce78 kernel/cgroup/cpuset.c Waiman Long 2018-11-08 1997 * CS_CPU_EXCLUSIVE bit implicitly as well and cpus_allowed ee8dde0cd2ce78 kernel/cgroup/cpuset.c Waiman Long 2018-11-08 1998 * cannot be NULL. ee8dde0cd2ce78 kernel/cgroup/cpuset.c Waiman Long 2018-11-08 1999 */ ee8dde0cd2ce78 kernel/cgroup/cpuset.c Waiman Long 2018-11-08 2000 if (cpumask_empty(cs->cpus_allowed)) ee8dde0cd2ce78 kernel/cgroup/cpuset.c Waiman Long 2018-11-08 2001 goto out; ee8dde0cd2ce78 kernel/cgroup/cpuset.c Waiman Long 2018-11-08 2002 ee8dde0cd2ce78 kernel/cgroup/cpuset.c Waiman Long 2018-11-08 2003 err = update_flag(CS_CPU_EXCLUSIVE, cs, 1); ee8dde0cd2ce78 kernel/cgroup/cpuset.c Waiman Long 2018-11-08 2004 if (err) ee8dde0cd2ce78 kernel/cgroup/cpuset.c Waiman Long 2018-11-08 2005 goto out; ee8dde0cd2ce78 kernel/cgroup/cpuset.c Waiman Long 2018-11-08 2006 ee8dde0cd2ce78 kernel/cgroup/cpuset.c Waiman Long 2018-11-08 2007 err = update_parent_subparts_cpumask(cs, partcmd_enable, 0f3adb8a1e5f36 kernel/cgroup/cpuset.c Waiman Long 2021-07-20 2008 NULL, &tmpmask); ee8dde0cd2ce78 kernel/cgroup/cpuset.c Waiman Long 2018-11-08 2009 if (err) { ee8dde0cd2ce78 kernel/cgroup/cpuset.c Waiman Long 2018-11-08 2010 update_flag(CS_CPU_EXCLUSIVE, cs, 0); ee8dde0cd2ce78 kernel/cgroup/cpuset.c Waiman Long 2018-11-08 2011 goto out; ee8dde0cd2ce78 kernel/cgroup/cpuset.c Waiman Long 2018-11-08 2012 } ee8dde0cd2ce78 kernel/cgroup/cpuset.c Waiman Long 2018-11-08 2013 } else { 3881b86128d0be kernel/cgroup/cpuset.c Waiman Long 2018-11-08 2014 /* 56ec7dd271c77e kernel/cgroup/cpuset.c Waiman Long 2021-08-14 2015 * Switch back to member is always allowed even if it 56ec7dd271c77e kernel/cgroup/cpuset.c Waiman Long 2021-08-14 2016 * causes child partitions to become invalid. 3881b86128d0be kernel/cgroup/cpuset.c Waiman Long 2018-11-08 2017 */ 3881b86128d0be kernel/cgroup/cpuset.c Waiman Long 2018-11-08 2018 err = 0; 56ec7dd271c77e kernel/cgroup/cpuset.c Waiman Long 2021-08-14 2019 update_parent_subparts_cpumask(cs, partcmd_disable, NULL, 56ec7dd271c77e kernel/cgroup/cpuset.c Waiman Long 2021-08-14 2020 &tmpmask); 56ec7dd271c77e kernel/cgroup/cpuset.c Waiman Long 2021-08-14 2021 /* 56ec7dd271c77e kernel/cgroup/cpuset.c Waiman Long 2021-08-14 2022 * If there are child partitions, we have to make them invalid. 56ec7dd271c77e kernel/cgroup/cpuset.c Waiman Long 2021-08-14 2023 */ 56ec7dd271c77e kernel/cgroup/cpuset.c Waiman Long 2021-08-14 2024 if (unlikely(cs->nr_subparts_cpus)) { 56ec7dd271c77e kernel/cgroup/cpuset.c Waiman Long 2021-08-14 2025 struct tmpmasks tmp; 3881b86128d0be kernel/cgroup/cpuset.c Waiman Long 2018-11-08 2026 56ec7dd271c77e kernel/cgroup/cpuset.c Waiman Long 2021-08-14 2027 spin_lock_irq(&callback_lock); 56ec7dd271c77e kernel/cgroup/cpuset.c Waiman Long 2021-08-14 2028 cs->nr_subparts_cpus = 0; 56ec7dd271c77e kernel/cgroup/cpuset.c Waiman Long 2021-08-14 2029 cpumask_clear(cs->subparts_cpus); 56ec7dd271c77e kernel/cgroup/cpuset.c Waiman Long 2021-08-14 2030 compute_effective_cpumask(cs->effective_cpus, cs, parent); 56ec7dd271c77e kernel/cgroup/cpuset.c Waiman Long 2021-08-14 2031 spin_unlock_irq(&callback_lock); 56ec7dd271c77e kernel/cgroup/cpuset.c Waiman Long 2021-08-14 2032 56ec7dd271c77e kernel/cgroup/cpuset.c Waiman Long 2021-08-14 2033 /* 56ec7dd271c77e kernel/cgroup/cpuset.c Waiman Long 2021-08-14 2034 * If alloc_cpumasks() fails, we are running out 56ec7dd271c77e kernel/cgroup/cpuset.c Waiman Long 2021-08-14 2035 * of memory and there isn't much we can do. 56ec7dd271c77e kernel/cgroup/cpuset.c Waiman Long 2021-08-14 2036 */ 56ec7dd271c77e kernel/cgroup/cpuset.c Waiman Long 2021-08-14 2037 if (!alloc_cpumasks(NULL, &tmp)) { 56ec7dd271c77e kernel/cgroup/cpuset.c Waiman Long 2021-08-14 2038 update_cpumasks_hier(cs, &tmp); 56ec7dd271c77e kernel/cgroup/cpuset.c Waiman Long 2021-08-14 2039 free_cpumasks(NULL, &tmp); 56ec7dd271c77e kernel/cgroup/cpuset.c Waiman Long 2021-08-14 2040 } 56ec7dd271c77e kernel/cgroup/cpuset.c Waiman Long 2021-08-14 2041 } ee8dde0cd2ce78 kernel/cgroup/cpuset.c Waiman Long 2018-11-08 2042 ee8dde0cd2ce78 kernel/cgroup/cpuset.c Waiman Long 2018-11-08 2043 /* Turning off CS_CPU_EXCLUSIVE will not return error */ ee8dde0cd2ce78 kernel/cgroup/cpuset.c Waiman Long 2018-11-08 2044 update_flag(CS_CPU_EXCLUSIVE, cs, 0); ee8dde0cd2ce78 kernel/cgroup/cpuset.c Waiman Long 2018-11-08 2045 } ee8dde0cd2ce78 kernel/cgroup/cpuset.c Waiman Long 2018-11-08 2046 ee8dde0cd2ce78 kernel/cgroup/cpuset.c Waiman Long 2018-11-08 2047 /* ee8dde0cd2ce78 kernel/cgroup/cpuset.c Waiman Long 2018-11-08 2048 * Update cpumask of parent's tasks except when it is the top ee8dde0cd2ce78 kernel/cgroup/cpuset.c Waiman Long 2018-11-08 2049 * cpuset as some system daemons cannot be mapped to other CPUs. ee8dde0cd2ce78 kernel/cgroup/cpuset.c Waiman Long 2018-11-08 2050 */ ee8dde0cd2ce78 kernel/cgroup/cpuset.c Waiman Long 2018-11-08 2051 if (parent != &top_cpuset) ee8dde0cd2ce78 kernel/cgroup/cpuset.c Waiman Long 2018-11-08 2052 update_tasks_cpumask(parent); ee8dde0cd2ce78 kernel/cgroup/cpuset.c Waiman Long 2018-11-08 2053 4716909cc5c566 kernel/cgroup/cpuset.c Waiman Long 2018-11-08 2054 if (parent->child_ecpus_count) 0f3adb8a1e5f36 kernel/cgroup/cpuset.c Waiman Long 2021-07-20 2055 update_sibling_cpumasks(parent, cs, &tmpmask); 4716909cc5c566 kernel/cgroup/cpuset.c Waiman Long 2018-11-08 2056 ee8dde0cd2ce78 kernel/cgroup/cpuset.c Waiman Long 2018-11-08 2057 rebuild_sched_domains_locked(); ee8dde0cd2ce78 kernel/cgroup/cpuset.c Waiman Long 2018-11-08 2058 out: 6ba34d3c73674e kernel/cgroup/cpuset.c Waiman Long 2021-07-20 2059 if (!err) { 6ba34d3c73674e kernel/cgroup/cpuset.c Waiman Long 2021-07-20 2060 spin_lock_irq(&callback_lock); 6ba34d3c73674e kernel/cgroup/cpuset.c Waiman Long 2021-07-20 2061 cs->partition_root_state = new_prs; 6ba34d3c73674e kernel/cgroup/cpuset.c Waiman Long 2021-07-20 2062 spin_unlock_irq(&callback_lock); e7cc9888dc5792 kernel/cgroup/cpuset.c Waiman Long 2021-08-10 2063 notify_partition_change(cs, old_prs, new_prs); 6ba34d3c73674e kernel/cgroup/cpuset.c Waiman Long 2021-07-20 2064 } 6ba34d3c73674e kernel/cgroup/cpuset.c Waiman Long 2021-07-20 2065 0f3adb8a1e5f36 kernel/cgroup/cpuset.c Waiman Long 2021-07-20 2066 free_cpumasks(NULL, &tmpmask); 645fcc9d2f6946 kernel/cpuset.c Li Zefan 2009-01-07 2067 return err; ^1da177e4c3f41 kernel/cpuset.c Linus Torvalds 2005-04-16 @2068 } ^1da177e4c3f41 kernel/cpuset.c Linus Torvalds 2005-04-16 2069
--- 0-DAY CI Kernel Test Service, Intel Corporation https://lists.01.org/hyperkitty/list/kbuild-all@lists.01.org
On 8/14/21 4:21 PM, kernel test robot wrote:
Hi Waiman,
I love your patch! Perhaps something to improve:
[auto build test WARNING on cgroup/for-next] [also build test WARNING on next-20210813] [cannot apply to kselftest/next v5.14-rc5] [If your patch is applied to the wrong git tree, kindly drop us a note. And when submitting patch, we suggest to use '--base' as documented in https://git-scm.com/docs/git-format-patch]
url: https://github.com/0day-ci/linux/commits/Waiman-Long/cgroup-cpuset-Add-new-c... base: https://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup.git for-next config: ia64-defconfig (attached as .config) compiler: ia64-linux-gcc (GCC) 11.2.0 reproduce (this is a W=1 build): wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross chmod +x ~/bin/make.cross # https://github.com/0day-ci/linux/commit/56ec7dd271c77e3cc92f0df6fd766004a7a0... git remote add linux-review https://github.com/0day-ci/linux git fetch --no-tags linux-review Waiman-Long/cgroup-cpuset-Add-new-cpuset-partition-type-empty-effecitve-cpus/20210815-014333 git checkout 56ec7dd271c77e3cc92f0df6fd766004a7a0aa88 # save the attached .config to linux build tree COMPILER_INSTALL_PATH=$HOME/0day COMPILER=gcc-11.2.0 make.cross ARCH=ia64
If you fix the issue, kindly add following tag as appropriate Reported-by: kernel test robot lkp@intel.com
All warnings (new ones prefixed by >>):
kernel/cgroup/cpuset.c: In function 'update_prstate':
Oh, it was caused by a duplicated tmpmask in update_prstate() which isn't really necessary. Will send out a new version to fix that.
Thanks, Longman
There are a number of different reasons which can cause a partition to become invalid. A user seeing an invalid partition may not know exactly why. To help user to get a better understanding of the underlying reason, The cpuset.cpus.partition control file, when read, will now report the reason why a partition become invalid. When a partition does become invalid, reading the control file will show "root invalid (<reason>)" where <reason> is a string that describes why the partition is invalid.
Signed-off-by: Waiman Long longman@redhat.com --- kernel/cgroup/cpuset.c | 46 +++++++++++++++++++++++++++++++++++++++--- 1 file changed, 43 insertions(+), 3 deletions(-)
diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c index 7705950ad70b..5fec3c96465c 100644 --- a/kernel/cgroup/cpuset.c +++ b/kernel/cgroup/cpuset.c @@ -78,6 +78,24 @@ struct fmeter { spinlock_t lock; /* guards read or write of above */ };
+/* + * Invalid partition error code + */ +enum prs_errcode { + PERR_NONE = 0, + PERR_INVCPUS, + PERR_NOCPUS, + PERR_PARENT, + PERR_HOTPLUG, +}; + +static const char * const perr_strings[] = { + [PERR_INVCPUS] = "Invalid change to cpuset.cpus", + [PERR_PARENT] = "Parent is no longer a partition root", + [PERR_NOCPUS] = "Parent unable to distribute cpu downstream", + [PERR_HOTPLUG] = "No cpu available due to hotplug", +}; + struct cpuset { struct cgroup_subsys_state css;
@@ -163,6 +181,9 @@ struct cpuset {
/* Handle for cpuset.cpus.partition */ struct cgroup_file partition_file; + + /* Invalid partiton error code, not lock protected */ + enum prs_errcode prs_err; };
/* @@ -272,8 +293,13 @@ static inline int is_partition_root(const struct cpuset *cs) static inline void notify_partition_change(struct cpuset *cs, int old_prs, int new_prs) { - if (old_prs != new_prs) - cgroup_file_notify(&cs->partition_file); + if (old_prs == new_prs) + return; + cgroup_file_notify(&cs->partition_file); + + /* Reset prs_err if not invalid */ + if (new_prs != PRS_ERROR) + WRITE_ONCE(cs->prs_err, PERR_NONE); }
static struct cpuset top_cpuset = { @@ -1243,6 +1269,8 @@ static int update_parent_subparts_cpumask(struct cpuset *cpuset, int cmd, cpu_active_mask)) part_error = true; cpumask_copy(tmp->addmask, parent->effective_cpus); + if ((READ_ONCE(cpuset->prs_err) == PERR_NONE) && part_error) + WRITE_ONCE(cpuset->prs_err, PERR_INVCPUS); } } else { /* @@ -1264,6 +1292,8 @@ static int update_parent_subparts_cpumask(struct cpuset *cpuset, int cmd, part_error = (is_partition_root(cpuset) && !parent->nr_subparts_cpus) || cpumask_equal(tmp->addmask, parent->effective_cpus); + if (is_partition_root(cpuset) && part_error) + WRITE_ONCE(cpuset->prs_err, PERR_NOCPUS); }
if (cmd == partcmd_update) { @@ -1427,6 +1457,7 @@ static void update_cpumasks_hier(struct cpuset *cs, struct tmpmasks *tmp) * When parent is invalid, it has to be too. */ new_prs = PRS_ERROR; + WRITE_ONCE(cp->prs_err, PERR_PARENT); break; } } @@ -2551,6 +2582,7 @@ static s64 cpuset_read_s64(struct cgroup_subsys_state *css, struct cftype *cft) static int sched_partition_show(struct seq_file *seq, void *v) { struct cpuset *cs = css_cs(seq_css(seq)); + const char *err;
switch (cs->partition_root_state) { case PRS_ENABLED: @@ -2560,7 +2592,11 @@ static int sched_partition_show(struct seq_file *seq, void *v) seq_puts(seq, "member\n"); break; case PRS_ERROR: - seq_puts(seq, "root invalid\n"); + err = perr_strings[READ_ONCE(cs->prs_err)]; + if (err) + seq_printf(seq, "root invalid (%s)\n", err); + else + seq_puts(seq, "root invalid\n"); break; } return 0; @@ -3160,6 +3196,10 @@ static void cpuset_hotplug_update_tasks(struct cpuset *cs, struct tmpmasks *tmp) spin_lock_irq(&callback_lock); cs->partition_root_state = PRS_ERROR; spin_unlock_irq(&callback_lock); + if (parent->partition_root_state == PRS_ERROR) + WRITE_ONCE(cs->prs_err, PERR_PARENT); + else + WRITE_ONCE(cs->prs_err, PERR_HOTPLUG); notify_partition_change(cs, old_prs, PRS_ERROR); } cpuset_force_rebuild();
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=TBD
commit 994fb794cb252edd124a46ca0994e37a4726a100 Author: Waiman Long longman@redhat.com Date: Sat, 19 Jun 2021 13:28:19 -0400
cgroup/cpuset: Add a new isolated cpus.partition type
Cpuset v1 uses the sched_load_balance control file to determine if load balancing should be enabled. Cpuset v2 gets rid of sched_load_balance as its use may require disabling load balancing at cgroup root.
For workloads that require very low latency like DPDK, the latency jitters caused by periodic load balancing may exceed the desired latency limit.
When cpuset v2 is in use, the only way to avoid this latency cost is to use the "isolcpus=" kernel boot option to isolate a set of CPUs. After the kernel boot, however, there is no way to add or remove CPUs from this isolated set. For workloads that are more dynamic in nature, that means users have to provision enough CPUs for the worst case situation resulting in excess idle CPUs.
To address this issue for cpuset v2, a new cpuset.cpus.partition type "isolated" is added which allows the creation of a cpuset partition without load balancing. This will allow system administrators to dynamically adjust the size of isolated partition to the current need of the workload without rebooting the system.
Signed-off-by: Waiman Long longman@redhat.com
Signed-off-by: Waiman Long longman@redhat.com --- kernel/cgroup/cpuset.c | 48 +++++++++++++++++++++++++++++++++++++----- 1 file changed, 43 insertions(+), 5 deletions(-)
diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c index 5fec3c96465c..d7b3eb024b84 100644 --- a/kernel/cgroup/cpuset.c +++ b/kernel/cgroup/cpuset.c @@ -193,6 +193,8 @@ struct cpuset { * * 1 - partition root * + * 2 - partition root without load balancing (isolated) + * * -1 - invalid partition root * None of the cpus in cpus_allowed can be put into the parent's * subparts_cpus. In this case, the cpuset is not a real partition @@ -202,6 +204,7 @@ struct cpuset { */ #define PRS_DISABLED 0 #define PRS_ENABLED 1 +#define PRS_ISOLATED 2 #define PRS_ERROR -1
/* @@ -1298,17 +1301,22 @@ static int update_parent_subparts_cpumask(struct cpuset *cpuset, int cmd,
if (cmd == partcmd_update) { /* - * Check for possible transition between PRS_ENABLED - * and PRS_ERROR. + * Check for possible transition between PRS_ERROR and + * PRS_ENABLED/PRS_ISOLATED. */ switch (cpuset->partition_root_state) { case PRS_ENABLED: + case PRS_ISOLATED: if (part_error) new_prs = PRS_ERROR; break; case PRS_ERROR: - if (!part_error) + if (part_error) + break; + if (is_sched_load_balance(cpuset)) new_prs = PRS_ENABLED; + else + new_prs = PRS_ISOLATED; break; } } @@ -1443,6 +1451,7 @@ static void update_cpumasks_hier(struct cpuset *cs, struct tmpmasks *tmp) break;
case PRS_ENABLED: + case PRS_ISOLATED: if (update_parent_subparts_cpumask(cp, partcmd_update, NULL, tmp)) update_tasks_cpumask(parent); /* @@ -1468,7 +1477,7 @@ static void update_cpumasks_hier(struct cpuset *cs, struct tmpmasks *tmp)
spin_lock_irq(&callback_lock);
- if (cp->nr_subparts_cpus && (new_prs != PRS_ENABLED)) { + if (cp->nr_subparts_cpus && (new_prs <= 0)) { /* * Put all active subparts_cpus back to effective_cpus. */ @@ -2007,6 +2016,7 @@ static int update_prstate(struct cpuset *cs, int new_prs) int err, old_prs = cs->partition_root_state; struct cpuset *parent = parent_cs(cs); struct tmpmasks tmpmask; + bool sched_domain_rebuilt = false;
if (old_prs == new_prs) return 0; @@ -2041,6 +2051,22 @@ static int update_prstate(struct cpuset *cs, int new_prs) update_flag(CS_CPU_EXCLUSIVE, cs, 0); goto out; } + + if (new_prs == PRS_ISOLATED) { + /* + * Disable the load balance flag should not return an + * error unless the system is running out of memory. + */ + update_flag(CS_SCHED_LOAD_BALANCE, cs, 0); + sched_domain_rebuilt = true; + } + } else if (old_prs && new_prs) { + /* + * A change in load balance state only, no change in cpumasks. + */ + update_flag(CS_SCHED_LOAD_BALANCE, cs, (new_prs != PRS_ISOLATED)); + err = 0; + goto out; /* Sched domain is rebuilt in update_flag() */ } else { /* * Switch back to member is always allowed even if it @@ -2073,6 +2099,12 @@ static int update_prstate(struct cpuset *cs, int new_prs)
/* Turning off CS_CPU_EXCLUSIVE will not return error */ update_flag(CS_CPU_EXCLUSIVE, cs, 0); + + if (!is_sched_load_balance(cs)) { + /* Make sure load balance is on */ + update_flag(CS_SCHED_LOAD_BALANCE, cs, 1); + sched_domain_rebuilt = true; + } }
/* @@ -2085,7 +2117,8 @@ static int update_prstate(struct cpuset *cs, int new_prs) if (parent->child_ecpus_count) update_sibling_cpumasks(parent, cs, &tmpmask);
- rebuild_sched_domains_locked(); + if (!sched_domain_rebuilt) + rebuild_sched_domains_locked(); out: if (!err) { spin_lock_irq(&callback_lock); @@ -2588,6 +2621,9 @@ static int sched_partition_show(struct seq_file *seq, void *v) case PRS_ENABLED: seq_puts(seq, "root\n"); break; + case PRS_ISOLATED: + seq_puts(seq, "isolated\n"); + break; case PRS_DISABLED: seq_puts(seq, "member\n"); break; @@ -2618,6 +2654,8 @@ static ssize_t sched_partition_write(struct kernfs_open_file *of, char *buf, val = PRS_ENABLED; else if (!strcmp(buf, "member")) val = PRS_DISABLED; + else if (!strcmp(buf, "isolated")) + val = PRS_ISOLATED; else return -EINVAL;
Currently, a parent partition cannot distribute all its CPUs to child partitions with no CPUs left. However in some use cases, a management application may want to create a parent partition as a management unit with no task associated with it and has all its CPUs distributed to various child partitions dynamically according to their needs. Leaving a cpu in the parent partition in such a case is now a waste.
To accommodate such use cases, a parent partition can now have all its CPUs distributed to its child partitions with 0 effective cpu left as long as it is not the top cpuset and it has no task at the time the child partition is being created. A terminal partition with no child partition underlying it, however, cannot have 0 effective cpu which will make the partition invalid.
Once an empty parent partition is formed, no new task can be moved into it.
Signed-off-by: Waiman Long longman@redhat.com --- kernel/cgroup/cpuset.c | 96 ++++++++++++++++++++++++++++++------------ 1 file changed, 69 insertions(+), 27 deletions(-)
diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c index d7b3eb024b84..73eff47a38de 100644 --- a/kernel/cgroup/cpuset.c +++ b/kernel/cgroup/cpuset.c @@ -305,6 +305,21 @@ static inline void notify_partition_change(struct cpuset *cs, WRITE_ONCE(cs->prs_err, PERR_NONE); }
+static inline int cpuset_has_tasks(const struct cpuset *cs) +{ + return cs->css.cgroup->nr_populated_csets; +} + +/* + * A empty partition (one with no effective cpu) is valid if it has no + * associated task and all its cpus have been distributed out to child + * partitions. + */ +static inline bool valid_empty_partition(const struct cpuset *cs) +{ + return !cpuset_has_tasks(cs) && cs->nr_subparts_cpus; +} + static struct cpuset top_cpuset = { .flags = ((1 << CS_ONLINE) | (1 << CS_CPU_EXCLUSIVE) | (1 << CS_MEM_EXCLUSIVE)), @@ -1211,22 +1226,32 @@ static int update_parent_subparts_cpumask(struct cpuset *cpuset, int cmd, if ((cmd == partcmd_enable) && css_has_online_children(&cpuset->css)) return -EBUSY;
- /* - * Enabling partition root is not allowed if not all the CPUs - * can be granted from parent's effective_cpus or at least one - * CPU will be left after that. - */ - if ((cmd == partcmd_enable) && - (!cpumask_subset(cpuset->cpus_allowed, parent->effective_cpus) || - cpumask_equal(cpuset->cpus_allowed, parent->effective_cpus))) - return -EINVAL; - /* * A cpumask update cannot make parent's effective_cpus become empty. */ adding = deleting = false; old_prs = new_prs = cpuset->partition_root_state; if (cmd == partcmd_enable) { + bool parent_is_top_cpuset = !parent_cs(parent); + bool no_cpu_in_parent = cpumask_equal(cpuset->cpus_allowed, + parent->effective_cpus); + /* + * Enabling partition root is not allowed if not all the CPUs + * can be granted from parent's effective_cpus. If the parent + * is the top cpuset, at least one CPU must be left after that. + */ + if (!cpumask_subset(cpuset->cpus_allowed, parent->effective_cpus) || + (parent_is_top_cpuset && no_cpu_in_parent)) + return -EINVAL; + + /* + * A non-top parent can be left with no CPU as long as there + * is no task directly associated with the parent. For such + * a parent, no new task can be moved into it. + */ + if (no_cpu_in_parent && cpuset_has_tasks(parent)) + return -EINVAL; + cpumask_copy(tmp->addmask, cpuset->cpus_allowed); adding = true; } else if (cmd == partcmd_disable) { @@ -1257,9 +1282,9 @@ static int update_parent_subparts_cpumask(struct cpuset *cpuset, int cmd, parent->subparts_cpus); /* * Make partition invalid if parent's effective_cpus could - * become empty. + * become empty and there are tasks in the parent. */ - if (adding && + if (adding && cpuset_has_tasks(parent) && cpumask_equal(parent->effective_cpus, tmp->addmask)) { if (!deleting) part_error = true; @@ -1294,7 +1319,9 @@ static int update_parent_subparts_cpumask(struct cpuset *cpuset, int cmd, parent->effective_cpus); part_error = (is_partition_root(cpuset) && !parent->nr_subparts_cpus) || - cpumask_equal(tmp->addmask, parent->effective_cpus); + (cpumask_equal(tmp->addmask, parent->effective_cpus) && + cpuset_has_tasks(parent)); + if (is_partition_root(cpuset) && part_error) WRITE_ONCE(cpuset->prs_err, PERR_NOCPUS); } @@ -1397,9 +1424,15 @@ static void update_cpumasks_hier(struct cpuset *cs, struct tmpmasks *tmp)
/* * If it becomes empty, inherit the effective mask of the - * parent, which is guaranteed to have some CPUs. + * parent, which is guaranteed to have some CPUs unless + * it is a partition root that has explicitly distributed + * out all its CPUs. */ if (is_in_v2_mode() && cpumask_empty(tmp->new_cpus)) { + if (is_partition_root(cp) && + cpumask_equal(cp->cpus_allowed, cp->subparts_cpus)) + goto update_parent_subparts; + cpumask_copy(tmp->new_cpus, parent->effective_cpus); if (!cp->use_parent_ecpus) { cp->use_parent_ecpus = true; @@ -1421,6 +1454,7 @@ static void update_cpumasks_hier(struct cpuset *cs, struct tmpmasks *tmp) continue; }
+update_parent_subparts: /* * update_parent_subparts_cpumask() should have been called * for cs already in update_cpumask(). We should also call @@ -1497,12 +1531,9 @@ static void update_cpumasks_hier(struct cpuset *cs, struct tmpmasks *tmp) */ cpumask_andnot(cp->effective_cpus, cp->effective_cpus, cp->subparts_cpus); - WARN_ON_ONCE(cpumask_empty(cp->effective_cpus)); }
- if (new_prs != old_prs) - cp->partition_root_state = new_prs; - + cp->partition_root_state = new_prs; spin_unlock_irq(&callback_lock); notify_partition_change(cp, old_prs, new_prs);
@@ -2254,6 +2285,13 @@ static int cpuset_can_attach(struct cgroup_taskset *tset) (cpumask_empty(cs->cpus_allowed) || nodes_empty(cs->mems_allowed))) goto out_unlock;
+ /* + * On default hierarchy, task cannot be moved to a cpuset with empty + * effective cpus. + */ + if (is_in_v2_mode() && cpumask_empty(cs->effective_cpus)) + goto out_unlock; + cgroup_taskset_for_each(task, css, tset) { ret = task_can_attach(task, cs->cpus_allowed); if (ret) @@ -3130,7 +3168,8 @@ hotplug_update_tasks(struct cpuset *cs, struct cpumask *new_cpus, nodemask_t *new_mems, bool cpus_updated, bool mems_updated) { - if (cpumask_empty(new_cpus)) + /* A partition root is allowed to have empty effective cpus */ + if (cpumask_empty(new_cpus) && !is_partition_root(cs)) cpumask_copy(new_cpus, parent_cs(cs)->effective_cpus); if (nodes_empty(*new_mems)) *new_mems = parent_cs(cs)->effective_mems; @@ -3199,22 +3238,25 @@ static void cpuset_hotplug_update_tasks(struct cpuset *cs, struct tmpmasks *tmp)
/* * In the unlikely event that a partition root has empty - * effective_cpus, we will have to force any child partitions, - * if present, to become invalid by setting nr_subparts_cpus to 0 - * without causing itself to become invalid. + * effective_cpus with tasks, we will have to force any child + * partitions, if present, to become invalid by setting + * nr_subparts_cpus to 0 without causing itself to become invalid. */ - if (is_partition_root(cs) && cs->nr_subparts_cpus && - cpumask_empty(&new_cpus)) { + if (is_partition_root(cs) && cpumask_empty(&new_cpus) && + !valid_empty_partition(cs)) { cs->nr_subparts_cpus = 0; cpumask_clear(cs->subparts_cpus); compute_effective_cpumask(&new_cpus, cs, parent); }
/* - * If empty effective_cpus or zero nr_subparts_cpus or its parent - * becomes erroneous, we have to transition it to the erroneous state. + * Force the partition to become invalid if either one of + * the following conditions hold: + * 1) empty effective cpus but not valid empty partition. + * 2) parent is invalid or doesn't grant any cpus to child partitions. */ - if (is_partition_root(cs) && (cpumask_empty(&new_cpus) || + if (is_partition_root(cs) && + ((cpumask_empty(&new_cpus) && !valid_empty_partition(cs)) || (parent->partition_root_state == PRS_ERROR) || !parent->nr_subparts_cpus)) { int old_prs;
Update Documentation/admin-guide/cgroup-v2.rst on the newly introduced "isolated" cpuset partition type as well as the ability to create non-top cpuset partition with no cpu allocated to it.
Signed-off-by: Waiman Long longman@redhat.com --- Documentation/admin-guide/cgroup-v2.rst | 116 +++++++++++++++--------- 1 file changed, 71 insertions(+), 45 deletions(-)
diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst index babbe04c8d37..9ad52f74fb12 100644 --- a/Documentation/admin-guide/cgroup-v2.rst +++ b/Documentation/admin-guide/cgroup-v2.rst @@ -2091,8 +2091,9 @@ Cpuset Interface Files It accepts only the following input values when written to.
======== ================================ - "root" a partition root - "member" a non-root member of a partition + "member" Non-root member of a partition + "root" Partition root + "isolated" Partition root without load balancing ======== ================================
When set to be a partition root, the current cgroup is the @@ -2101,64 +2102,89 @@ Cpuset Interface Files partition roots themselves and their descendants. The root cgroup is always a partition root.
- There are constraints on where a partition root can be set. - It can only be set in a cgroup if all the following conditions - are true. + When set to "isolated", the CPUs in that partition root will + be in an isolated state without any load balancing from the + scheduler. Tasks in such a partition must be explicitly bound + to each individual CPU. + + There are constraints on where a partition root can be set + ("root" or "isolated"). It can only be set in a cgroup if all + the following conditions are true.
1) The "cpuset.cpus" is not empty and the list of CPUs are exclusive, i.e. they are not shared by any of its siblings. 2) The parent cgroup is a partition root. - 3) The "cpuset.cpus" is also a proper subset of the parent's + 3) The "cpuset.cpus" is a subset of the parent's "cpuset.cpus.effective". 4) There is no child cgroups with cpuset enabled. This is for eliminating corner cases that have to be handled if such a condition is allowed.
- Setting it to partition root will take the CPUs away from the - effective CPUs of the parent cgroup. Once it is set, this + Setting it to a partition root will take the CPUs away from + the effective CPUs of the parent cgroup. Once it is set, this file cannot be reverted back to "member" if there are any child cgroups with cpuset enabled.
- A parent partition cannot distribute all its CPUs to its - child partitions. There must be at least one cpu left in the - parent partition. - - Once becoming a partition root, changes to "cpuset.cpus" is - generally allowed as long as the first condition above is true, - the change will not take away all the CPUs from the parent - partition and the new "cpuset.cpus" value is a superset of its - children's "cpuset.cpus" values. - - Sometimes, external factors like changes to ancestors' - "cpuset.cpus" or cpu hotplug can cause the state of the partition - root to change. On read, the "cpuset.sched.partition" file - can show the following values. - - ============== ============================== - "member" Non-root member of a partition - "root" Partition root - "root invalid" Invalid partition root - ============== ============================== - - It is a partition root if the first 2 partition root conditions - above are true and at least one CPU from "cpuset.cpus" is - granted by the parent cgroup. - - A partition root can become invalid if none of CPUs requested - in "cpuset.cpus" can be granted by the parent cgroup or the - parent cgroup is no longer a partition root itself. In this - case, it is not a real partition even though the restriction - of the first partition root condition above will still apply. + A parent partition may distribute all its CPUs to its child + partitions as long as it is not the root cgroup and there is no + task directly associated with that parent partition. Otherwise, + there must be at least one cpu left in the parent partition. + A new task cannot be moved to a partition root with no effective + cpu. + + Once becoming a partition root, changes to "cpuset.cpus" + is generally allowed as long as the first condition above + (cpu exclusivity rule) is true. + + Sometimes, changes to "cpuset.cpus" or cpu hotplug may cause + the state of the partition root to become invalid when the + other constraints of partition root are violated. Therefore, + it is recommended that users should always set "cpuset.cpus" + to the proper value first before enabling partition. In case + "cpuset.cpus" has to be modified after partition is enabled, + users should check the state of "cpuset.cpus.partition" after + making change to it to make sure that the partition is still + valid. + + On read, the "cpuset.cpus.partition" file can show the following + values. + + ====================== ============================== + "member" Non-root member of a partition + "root" Partition root + "isolated" Partition root without load balancing + "root invalid (<reason>)" Invalid partition root + ====================== ============================== + + A partition root becomes invalid if all the CPUs requested in + "cpuset.cpus" become unavailable. This can happen if all the + CPUs have been offlined, or the state of an ancestor partition + root become invalid. "<reason>" is a string that describes why + the partition becomes invalid. + + An invalid partition is not a real partition even though the + restriction of the cpu exclusivity rule will still apply. The cpu affinity of all the tasks in the cgroup will then be associated with CPUs in the nearest ancestor partition.
- An invalid partition root can be transitioned back to a - real partition root if at least one of the requested CPUs - can now be granted by its parent. In this case, the cpu - affinity of all the tasks in the formerly invalid partition - will be associated to the CPUs of the newly formed partition. - Changing the partition state of an invalid partition root to - "member" is always allowed even if child cpusets are present. + In the special case of a parent partition competing with a child + partition for the only CPU left, the parent partition wins and + the child partition becomes invalid. + + An invalid partition root can be transitioned back to a real + partition root if at least one of the requested CPUs become + available again. In this case, the cpu affinity of all the tasks + in the formerly invalid partition will be associated to the CPUs + of the newly formed partition. Changing the partition state of + an invalid partition root to "member" is always allowed even if + child cpusets are present. However changing a partition root back + to member will not be allowed if child partitions are present. + + Poll and inotify events are triggered whenever the state + of "cpuset.cpus.partition" changes. That includes changes + caused by write to "cpuset.cpus.partition" and cpu hotplug. + This will allow an user space agent to monitor changes caused + by hotplug events.
Device controller
Add a test script test_cpuset_prs.sh with a helper program wait_inotify for exercising the cpuset v2 partition root state code.
Signed-off-by: Waiman Long longman@redhat.com --- tools/testing/selftests/cgroup/Makefile | 5 +- .../selftests/cgroup/test_cpuset_prs.sh | 663 ++++++++++++++++++ tools/testing/selftests/cgroup/wait_inotify.c | 86 +++ 3 files changed, 752 insertions(+), 2 deletions(-) create mode 100755 tools/testing/selftests/cgroup/test_cpuset_prs.sh create mode 100644 tools/testing/selftests/cgroup/wait_inotify.c
diff --git a/tools/testing/selftests/cgroup/Makefile b/tools/testing/selftests/cgroup/Makefile index 59e222460581..3f1fd3f93f41 100644 --- a/tools/testing/selftests/cgroup/Makefile +++ b/tools/testing/selftests/cgroup/Makefile @@ -1,10 +1,11 @@ # SPDX-License-Identifier: GPL-2.0 CFLAGS += -Wall -pthread
-all: +all: ${HELPER_PROGS}
TEST_FILES := with_stress.sh -TEST_PROGS := test_stress.sh +TEST_PROGS := test_stress.sh test_cpuset_prs.sh +TEST_GEN_FILES := wait_inotify TEST_GEN_PROGS = test_memcontrol TEST_GEN_PROGS += test_kmem TEST_GEN_PROGS += test_core diff --git a/tools/testing/selftests/cgroup/test_cpuset_prs.sh b/tools/testing/selftests/cgroup/test_cpuset_prs.sh new file mode 100755 index 000000000000..5a8fba05cbfb --- /dev/null +++ b/tools/testing/selftests/cgroup/test_cpuset_prs.sh @@ -0,0 +1,663 @@ +#!/bin/bash +# SPDX-License-Identifier: GPL-2.0 +# +# Test for cpuset v2 partition root state (PRS) +# +# The sched verbose flag is set, if available, so that the console log +# can be examined for the correct setting of scheduling domain. +# + +skip_test() { + echo "$1" + echo "Test SKIPPED" + exit 0 +} + +[[ $(id -u) -eq 0 ]] || skip_test "Test must be run as root!" + +# Set sched verbose flag, if available +[[ -d /sys/kernel/debug/sched ]] && echo Y > /sys/kernel/debug/sched/verbose + +# Get wait_inotify location +WAIT_INOTIFY=$(cd $(dirname $0); pwd)/wait_inotify + +# Find cgroup v2 mount point +CGROUP2=$(mount -t cgroup2 | head -1 | awk -e '{print $3}') +[[ -n "$CGROUP2" ]] || skip_test "Cgroup v2 mount point not found!" + +CPUS=$(lscpu | grep "^CPU(s)" | sed -e "s/.*:[[:space:]]*//") +[[ $CPUS -lt 8 ]] && skip_test "Test needs at least 8 cpus available!" + +# Set verbose flag and delay factor +PROG=$1 +VERBOSE= +DELAY_FACTOR=1 +while [[ "$1" = -* ]] +do + case "$1" in + -v) VERBOSE=1 + break + ;; + -d) DELAY_FACTOR=$2 + shift + break + ;; + *) echo "Usage: $PROG [-v] [-d <delay-factor>" + exit + ;; + esac + shift +done + +cd $CGROUP2 +echo +cpuset > cgroup.subtree_control +[[ -d test ]] || mkdir test +cd test + +# Pause in ms +pause() +{ + DELAY=$1 + LOOP=0 + while [[ $LOOP -lt $DELAY_FACTOR ]] + do + sleep $DELAY + ((LOOP++)) + done + return 0 +} + +console_msg() +{ + MSG=$1 + echo "$MSG" + echo "" > /dev/console + echo "$MSG" > /dev/console + pause 0.01 +} + +test_partition() +{ + EXPECTED_VAL=$1 + echo $EXPECTED_VAL > cpuset.cpus.partition + [[ $? -eq 0 ]] || exit 1 + ACTUAL_VAL=$(cat cpuset.cpus.partition) + [[ $ACTUAL_VAL != $EXPECTED_VAL ]] && { + echo "cpuset.cpus.partition: expect $EXPECTED_VAL, found $EXPECTED_VAL" + echo "Test FAILED" + exit 1 + } +} + +test_effective_cpus() +{ + EXPECTED_VAL=$1 + ACTUAL_VAL=$(cat cpuset.cpus.effective) + [[ "$ACTUAL_VAL" != "$EXPECTED_VAL" ]] && { + echo "cpuset.cpus.effective: expect '$EXPECTED_VAL', found '$EXPECTED_VAL'" + echo "Test FAILED" + exit 1 + } +} + +# Adding current process to cgroup.procs as a test +test_add_proc() +{ + OUTSTR="$1" + ERRMSG=$((echo $$ > cgroup.procs) |& cat) + echo $ERRMSG | grep -q "$OUTSTR" + [[ $? -ne 0 ]] && { + echo "cgroup.procs: expect '$OUTSTR', got '$ERRMSG'" + echo "Test FAILED" + exit 1 + } + echo $$ > $CGROUP2/cgroup.procs # Move out the task +} + +# +# Testing the new "isolated" partition root type +# +test_isolated() +{ + echo 2-3 > cpuset.cpus + TYPE=$(cat cpuset.cpus.partition) + [[ $TYPE = member ]] || echo member > cpuset.cpus.partition + + console_msg "Change from member to root" + test_partition root + + console_msg "Change from root to isolated" + test_partition isolated + + console_msg "Change from isolated to member" + test_partition member + + console_msg "Change from member to isolated" + test_partition isolated + + console_msg "Change from isolated to root" + test_partition root + + console_msg "Change from root to member" + test_partition member + + # + # Testing partition root with no cpu + # + console_msg "Distribute all cpus to child partition" + echo +cpuset > cgroup.subtree_control + test_partition root + + mkdir A1 + cd A1 + echo 2-3 > cpuset.cpus + test_partition root + test_effective_cpus 2-3 + cd .. + test_effective_cpus "" + + console_msg "Moving task to partition test" + test_add_proc "No space left" + cd A1 + test_add_proc "" + cd .. + + console_msg "Shrink and expand child partition" + cd A1 + echo 2 > cpuset.cpus + cd .. + test_effective_cpus 3 + cd A1 + echo 2-3 > cpuset.cpus + cd .. + test_effective_cpus "" + + # Cleaning up + console_msg "Cleaning up" + echo $$ > $CGROUP2/cgroup.procs + [[ -d A1 ]] && rmdir A1 +} + +# +# Cpuset controller state transition test matrix. +# +# Cgroup test hierarchy +# +# test -- A1 -- A2 -- A3 +# - B1 +# +# P<v> = set cpus.partition (0:member, 1:root, 2:isolated, -1:root invalid) +# C<l> = add cpu-list +# S<p> = use prefix in subtree_control +# T = put a task into cgroup +# O<c>-<v> = Write <v> to CPU online file of <c> +# +SETUP_A123_PARTITIONS="C1-3:P1:S+ C2-3:P1:S+ C3:P1" +TEST_MATRIX=( + # test old-A1 old-A2 old-A3 old-B1 new-A1 new-A2 new-A3 new-B1 fail ECPUs Pstate + # ---- ------ ------ ------ ------ ------ ------ ------ ------ ---- ----- ------ + " S+ C0-1 . . C2-3 S+ C4-5 . . 0 A2:0-1" + " S+ C0-1 . . C2-3 P1 . . . 0 " + " S+ C0-1 . . C2-3 P1:S+ C0-1:P1 . . 0 " + " S+ C0-1 . . C2-3 P1:S+ C1:P1 . . 0 " + " S+ C0-1:S+ . . C2-3 . . . P1 0 " + " S+ C0-1:P1 . . C2-3 S+ C1 . . 0 " + " S+ C0-1:P1 . . C2-3 S+ C1:P1 . . 0 " + " S+ C0-1:P1 . . C2-3 S+ C1:P1 . P1 0 " + " S+ C0-1:P1 . . C2-3 C4-5 . . . 0 A1:4-5" + " S+ C0-1:P1 . . C2-3 S+:C4-5 . . . 0 A1:4-5" + " S+ C0-1 . . C2-3:P1 . . . C2 0 " + " S+ C0-1 . . C2-3:P1 . . . C4-5 0 B1:4-5" + " S+ C0-3:P1:S+ C2-3:P1 . . . . . . 0 A1:0-1,A2:2-3" + " S+ C0-3:P1:S+ C2-3:P1 . . C1-3 . . . 0 A1:1,A2:2-3" + " S+ C2-3:P1:S+ C3:P1 . . C3 . . . 0 A1:,A2:3 A1:P1,A2:P1" + " S+ C2-3:P1:S+ C3:P1 . . C3 P0 . . 0 A1:3,A2:3 A1:P1,A2:P0" + " S+ C2-3:P1:S+ C2:P1 . . C2-4 . . . 0 A1:3-4,A2:2" + " S+ C2-3:P1:S+ C3:P1 . . C3 . . C0-2 0 A1:,B1:0-2 A1:P1,A2:P1" + " S+ $SETUP_A123_PARTITIONS . C2-3 . . . 0 A1:,A2:2,A3:3 A1:P1,A2:P1,A3:P1" + + # CPU offlining cases: + " S+ C0-1 . . C2-3 S+ C4-5 . O2-0 0 A1:0-1,B1:3" + " S+ C0-3:P1:S+ C2-3:P1 . . O2-0 . . . 0 A1:0-1,A2:3" + " S+ C0-3:P1:S+ C2-3:P1 . . O2-0 O2-1 . . 0 A1:0-1,A2:2-3" + " S+ C0-3:P1:S+ C2-3:P1 . . O1-0 . . . 0 A1:0,A2:2-3" + " S+ C0-3:P1:S+ C2-3:P1 . . O1-0 O1-1 . . 0 A1:0-1,A2:2-3" + " S+ C2-3:P1:S+ C3:P1 . . O3-0 O3-1 . . 0 A1:2,A2:3 A1:P1,A2:P1" + " S+ C2-3:P1:S+ C3:P2 . . O3-0 O3-1 . . 0 A1:2,A2:3 A1:P1,A2:P2" + " S+ C2-3:P1:S+ C3:P1 . . O2-0 O2-1 . . 0 A1:2,A2:3 A1:P1,A2:P1" + " S+ C2-3:P1:S+ C3:P2 . . O2-0 O2-1 . . 0 A1:2,A2:3 A1:P1,A2:P2" + " S+ C2-3:P1:S+ C3:P1 . . O2-0 . . . 0 A1:,A2:3 A1:P1,A2:P1" + " S+ C2-3:P1:S+ C3:P1 . . O3-0 . . . 0 A1:2,A2:2 A1:P1,A2:P-1" + " S+ C2-3:P1:S+ C3:P1 . . T:O2-0 . . . 0 A1:3,A2:3 A1:P1,A2:P-1" + " S+ $SETUP_A123_PARTITIONS . O1-0 . . . 0 A1:,A2:2,A3:3 A1:P1,A2:P1,A3:P1" + " S+ $SETUP_A123_PARTITIONS . O2-0 . . . 0 A1:1,A2:,A3:3 A1:P1,A2:P1,A3:P1" + " S+ $SETUP_A123_PARTITIONS . O3-0 . . . 0 A1:1,A2:2,A3:2 A1:P1,A2:P1,A3:P-1" + " S+ $SETUP_A123_PARTITIONS . T:O1-0 . . . 0 A1:2-3,A2:2-3,A3:3 A1:P1,A2:P-1,A3:P-1" + " S+ $SETUP_A123_PARTITIONS . . T:O2-0 . . 0 A1:1,A2:3,A3:3 A1:P1,A2:P1,A3:P-1" + " S+ $SETUP_A123_PARTITIONS . . . T:O3-0 . 0 A1:1,A2:2,A3:2 A1:P1,A2:P1,A3:P-1" + " S+ $SETUP_A123_PARTITIONS . T:O1-0 O1-1 . . 0 A1:1,A2:2,A3:3 A1:P1,A2:P1,A3:P1" + " S+ $SETUP_A123_PARTITIONS . . T:O2-0 O2-1 . 0 A1:1,A2:2,A3:3 A1:P1,A2:P1,A3:P1" + " S+ $SETUP_A123_PARTITIONS . . . T:O3-0 O3-1 0 A1:1,A2:2,A3:3 A1:P1,A2:P1,A3:P1" + " S+ $SETUP_A123_PARTITIONS . T:O1-0 O2-0 O1-1 . 0 A1:1,A2:,A3:3 A1:P1,A2:P1,A3:P1" + " S+ $SETUP_A123_PARTITIONS . T:O1-0 O2-0 O2-1 . 0 A1:2-3,A2:2-3,A3:3 A1:P1,A2:P-1,A3:P-1" + + # test old-A1 old-A2 old-A3 old-B1 new-A1 new-A2 new-A3 new-B1 fail ECPUs Pstate + # ---- ------ ------ ------ ------ ------ ------ ------ ------ ---- ----- ------ + # + # Incorrect change to cpuset.cpus invalidates partition root + # + # Adding CPUs to partition root that are not in parent's + # cpuset.cpus.effective makes it invalid. + " S+ C2-3:P1:S+ C3:P1 . . . C2-4 . . 0 A1:2-3,A2:2-3 A1:P1,A2:P-1" + + # Taking away all CPUs from parent or itself if there are tasks + # will make the partition invalid. + " S+ C2-3:P1:S+ C3:P1 . . T C2-3 . . 0 A1:2-3,A2:2-3 A1:P1,A2:P-1" + " S+ $SETUP_A123_PARTITIONS . T:C2-3 . . . 0 A1:2-3,A2:2-3,A3:3 A1:P1,A2:P-1,A3:P-1" + " S+ $SETUP_A123_PARTITIONS . T:C2-3:C1-3 . . . 0 A1:1,A2:2,A3:3 A1:P1,A2:P1,A3:P1" + + # Changing a partition root member invalidates child partitions + " S+ C2-3:P1:S+ C3:P1 . . P0 . . . 0 A1:2-3,A2:3 A1:P0,A2:P-1" + " S+ $SETUP_A123_PARTITIONS . C2-3 P0 . . 0 A1:2-3,A2:2-3,A3:3 A1:P1,A2:P0,A3:P-1" + + # test old-A1 old-A2 old-A3 old-B1 new-A1 new-A2 new-A3 new-B1 fail ECPUs Pstate + # ---- ------ ------ ------ ------ ------ ------ ------ ------ ---- ----- ------ + # Failure cases: + + # To become a partition root, cpuset.cpus must be a subset of + # parent's cpuset.cpus.effective. + " S+ C0-1 . . C2-3 S+ C4-5:P1 . . 1 " + + # A cpuset cannot become a partition root if it has child cpusets + # with non-empty cpuset.cpus. + " S+ C0-1:S+ C1 . C2-3 P1 . . . 1 " + + # Any change to cpuset.cpus of a partition root must be exclusive. + " S+ C0-1:P1 . . C2-3 C0-2 . . . 1 " + " S+ C0-1 . . C2-3:P1 . . . C1 1 " + " S+ C2-3:P1:S+ C2:P1 . C1 C1-3 . . . 1 " + + # Deletion of CPUs distributed to child partition root is not allowed. + " S+ C0-1:P1:S+ C1 . C2-3 C4-5 . . . 1 " + " S+ C0-3:P1:S+ C2-3:P1 . . C0-2 . . . 1 " + + # A task cannot be added to a non-terminal partition with no cpu + " S+ C2-3:P1:S+ C3:P1 . . O2-0:T . . . 1 A1:,A2:3 A1:P1,A2:P1" +) + +# +# Write to the cpu online file +# $1 - <c>-<v> where <c> = cpu number, <v> value to be written +# +write_cpu_online() +{ + CPU=${1%-*} + VAL=${1#*-} + CPUFILE=//sys/devices/system/cpu/cpu${CPU}/online + if [[ $VAL -eq 0 ]] + then + OFFLINE_CPUS="$OFFLINE_CPUS $CPU" + else + [[ -n "$OFFLINE_CPUS" ]] && { + OFFLINE_CPUS=$(echo $CPU $CPU $OFFLINE_CPUS | fmt -1 |\ + sort | uniq -u) + } + fi + echo $VAL > $CPUFILE + pause 0.01 +} + +# +# Set controller state +# $1 - cgroup directory +# $2 - state +# $3 - showerr +# +# The presence of ":" in state means transition from one to the next. +# +set_ctrl_state() +{ + TMPMSG=/tmp/.msg_$$ + CGRP=$1 + STATE=$2 + SHOWERR=${3}${VERBOSE} + CTRL=${CTRL:=$CONTROLLER} + HASERR=0 + REDIRECT="2> $TMPMSG" + [[ -z "$STATE" || "$STATE" = '.' ]] && return 0 + + rm -f $TMPMSG + for CMD in $(echo $STATE | sed -e "s/:/ /g") + do + TFILE=$CGRP/cgroup.procs + SFILE=$CGRP/cgroup.subtree_control + PFILE=$CGRP/cpuset.cpus.partition + CFILE=$CGRP/cpuset.cpus + S=$(expr substr $CMD 1 1) + if [[ $S = S ]] + then + PREFIX=${CMD#?} + COMM="echo ${PREFIX}${CTRL} > $SFILE" + eval $COMM $REDIRECT + elif [[ $S = C ]] + then + CPUS=${CMD#?} + COMM="echo $CPUS > $CFILE" + eval $COMM $REDIRECT + elif [[ $S = P ]] + then + VAL=${CMD#?} + case $VAL in + 0) VAL=member + ;; + 1) VAL=root + ;; + 2) VAL=isolated + ;; + *) + echo "Invalid partiton state - $VAL" + exit 1 + ;; + esac + COMM="echo $VAL > $PFILE" + eval $COMM $REDIRECT + elif [[ $S = O ]] + then + VAL=${CMD#?} + write_cpu_online $VAL + elif [[ $S = T ]] + then + COMM="echo 0 > $TFILE" + eval $COMM $REDIRECT + fi + RET=$? + [[ $RET -ne 0 ]] && { + [[ -n "$SHOWERR" ]] && { + echo "$COMM" + cat $TMPMSG + } + HASERR=1 + } + pause 0.01 + rm -f $TMPMSG + done + return $HASERR +} + +set_ctrl_state_noerr() +{ + CGRP=$1 + STATE=$2 + [[ -d $CGRP ]] || mkdir $CGRP + set_ctrl_state $CGRP $STATE 1 + [[ $? -ne 0 ]] && { + echo "ERROR: Failed to set $2 to cgroup $1!" + exit 1 + } +} + +online_cpus() +{ + [[ -n "OFFLINE_CPUS" ]] && { + for C in $OFFLINE_CPUS + do + write_cpu_online ${C}-1 + done + } +} + +# +# Return 1 if the list of effective cpus isn't the same as the initial list. +# +reset_cgroup_states() +{ + echo 0 > $CGROUP2/cgroup.procs + online_cpus + rmdir A1/A2/A3 A1/A2 A1 B1 > /dev/null 2>&1 + set_ctrl_state . S- + pause 0.01 +} + +dump_states() +{ + for DIR in A1 A1/A2 A1/A2/A3 B1 + do + ECPUS=$DIR/cpuset.cpus.effective + PRS=$DIR/cpuset.cpus.partition + [[ -e $ECPUS ]] && echo "$ECPUS: $(cat $ECPUS)" + [[ -e $PRS ]] && echo "$PRS: $(cat $PRS)" + done +} + +# +# Check effective cpus +# $1 - check string, format: <cgroup>:<cpu-list>[,<cgroup>:<cpu-list>]* +# +check_effective_cpus() +{ + CHK_STR=$1 + for CHK in $(echo $CHK_STR | sed -e "s/,/ /g") + do + set -- $(echo $CHK | sed -e "s/:/ /g") + CGRP=$1 + CPUS=$2 + [[ $CGRP = A2 ]] && CGRP=A1/A2 + [[ $CGRP = A3 ]] && CGRP=A1/A2/A3 + FILE=$CGRP/cpuset.cpus.effective + [[ -e $FILE ]] || return 1 + [[ $CPUS = $(cat $FILE) ]] || return 1 + done +} + +# +# Check cgroup states +# $1 - check string, format: <cgroup>:<state>[,<cgroup>:<state>]* +# +check_cgroup_states() +{ + CHK_STR=$1 + for CHK in $(echo $CHK_STR | sed -e "s/,/ /g") + do + set -- $(echo $CHK | sed -e "s/:/ /g") + CGRP=$1 + STATE=$2 + FILE= + EVAL=$(expr substr $STATE 2 2) + [[ $CGRP = A2 ]] && CGRP=A1/A2 + [[ $CGRP = A3 ]] && CGRP=A1/A2/A3 + + case $STATE in + P*) FILE=$CGRP/cpuset.cpus.partition + ;; + *) echo "Unknown state: $STATE!" + exit 1 + ;; + esac + VAL=$(cat $FILE) + + case "$VAL" in + member) VAL=0 + ;; + root) VAL=1 + ;; + isolated) + VAL=2 + ;; + "root invalid"*) + VAL=-1 + ;; + esac + [[ $EVAL != $VAL ]] && return 1 + done + return 0 +} + +# +# Run cpuset state transition test +# $1 - test matrix name +# +# This test is somewhat fragile as delays (sleep x) are added in various +# places to make sure state changes are fully propagated before the next +# action. These delays may need to be adjusted if running in a slower machine. +# +run_state_test() +{ + TEST=$1 + CONTROLLER=cpuset + CPULIST=0-6 + I=0 + eval CNT="${#$TEST[@]}" + + reset_cgroup_states + echo $CPULIST > cpuset.cpus + echo root > cpuset.cpus.partition + console_msg "Running state transition test ..." + + while [[ $I -lt $CNT ]] + do + echo "Running test $I ..." > /dev/console + eval set -- "${$TEST[$I]}" + ROOT=$1 + OLD_A1=$2 + OLD_A2=$3 + OLD_A3=$4 + OLD_B1=$5 + NEW_A1=$6 + NEW_A2=$7 + NEW_A3=$8 + NEW_B1=$9 + RESULT=${10} + ECPUS=${11} + STATES=${12} + + set_ctrl_state_noerr . $ROOT + set_ctrl_state_noerr A1 $OLD_A1 + set_ctrl_state_noerr A1/A2 $OLD_A2 + set_ctrl_state_noerr A1/A2/A3 $OLD_A3 + set_ctrl_state_noerr B1 $OLD_B1 + RETVAL=0 + set_ctrl_state A1 $NEW_A1; ((RETVAL += $?)) + set_ctrl_state A1/A2 $NEW_A2; ((RETVAL += $?)) + set_ctrl_state A1/A2/A3 $NEW_A3; ((RETVAL += $?)) + set_ctrl_state B1 $NEW_B1; ((RETVAL += $?)) + + [[ $RETVAL -ne $RESULT ]] && { + echo "Test $TEST[$I] failed result check!" + eval echo "${$TEST[$I]}" + dump_states + online_cpus + exit 1 + } + + [[ -n "$ECPUS" && "$ECPUS" != . ]] && { + check_effective_cpus $ECPUS + [[ $? -ne 0 ]] && { + echo "Test $TEST[$I] failed effective CPU check!" + eval echo "${$TEST[$I]}" + echo + dump_states + online_cpus + exit 1 + } + } + + [[ -n "$STATES" ]] && { + check_cgroup_states $STATES + [[ $? -ne 0 ]] && { + echo "FAILED: Test $TEST[$I] failed states check!" + eval echo "${$TEST[$I]}" + echo + dump_states + online_cpus + exit 1 + } + } + + reset_cgroup_states + # + # Check to see if effective cpu list changes + # + pause 0.05 + NEWLIST=$(cat cpuset.cpus.effective) + [[ $NEWLIST != $CPULIST ]] && { + echo "Effective cpus changed to $NEWLIST after test $I!" + exit 1 + } + [[ -n "$VERBOSE" ]] && echo "Test $I done." + ((I++)) + done + echo "All $I tests of $TEST PASSED." + + echo member > cpuset.cpus.partition +} + +# +# Wait for inotify event for the given file and read it +# $1: cgroup file to wait for +# $2: file to store the read result +# +wait_inotify() +{ + CGROUP_FILE=$1 + OUTPUT_FILE=$2 + + $WAIT_INOTIFY $CGROUP_FILE + cat $CGROUP_FILE > $OUTPUT_FILE +} + +# +# Test if inotify events are properly generated when going into and out of +# invalid partition state. +# +test_inotify() +{ + ERR=0 + PRS=/tmp/.prs_$$ + [[ -f $WAIT_INOTIFY ]] || { + echo "wait_inotify not found, inotify test SKIPPED." + return + } + + pause 0.01 + echo 1 > cpuset.cpus + echo 0 > cgroup.procs + echo root > cpuset.cpus.partition + pause 0.01 + rm -f $PRS + wait_inotify $PWD/cpuset.cpus.partition $PRS & + pause 0.01 + set_ctrl_state . "O1-0" + pause 0.01 + check_cgroup_states ".:P-1" + if [[ $? -ne 0 ]] + then + echo "FAILED: Inotify test - partition not invalid" + ERR=1 + elif [[ ! -f $PRS ]] + then + echo "FAILED: Inotify test - event not generated" + ERR=1 + kill %1 + elif [[ $(cat $PRS) != "root invalid"* ]] + then + echo "FAILED: Inotify test - incorrect state" + cat $PRS + ERR=1 + fi + online_cpus + echo member > cpuset.cpus.partition + echo 0 > ../cgroup.procs + if [[ $ERR -ne 0 ]] + then + exit 1 + else + echo "Inotify test PASSED" + fi +} + +run_state_test TEST_MATRIX +test_isolated +test_inotify +echo "All tests PASSED." +cd .. +rmdir test diff --git a/tools/testing/selftests/cgroup/wait_inotify.c b/tools/testing/selftests/cgroup/wait_inotify.c new file mode 100644 index 000000000000..d0758881f6bb --- /dev/null +++ b/tools/testing/selftests/cgroup/wait_inotify.c @@ -0,0 +1,86 @@ +/* SPDX-License-Identifier: GPL-2.0 */ +/* + * Wait until an inotify event on the given cgroup file. + */ +#include <linux/limits.h> +#include <sys/inotify.h> +#include <sys/mman.h> +#include <sys/ptrace.h> +#include <sys/stat.h> +#include <sys/types.h> +#include <errno.h> +#include <fcntl.h> +#include <poll.h> +#include <stdio.h> +#include <stdlib.h> +#include <string.h> +#include <unistd.h> + +const char usage[] = "Usage: %s [-v] <cgroup_file>\n"; +static char *file; +static int verbose; + +static inline void fail_message(char *msg) +{ + fprintf(stderr, msg, file); + exit(1); +} + +int main(int argc, char *argv[]) +{ + char *cmd = argv[0]; + int c, fd; + struct pollfd fds = { .events = POLLIN, }; + + while ((c = getopt(argc, argv, "v")) != -1) { + switch (c) { + case 'v': + verbose++; + break; + } + argv++, argc--; + } + + if (argc != 2) { + fprintf(stderr, usage, cmd); + return -1; + } + file = argv[1]; + fd = open(file, O_RDONLY); + if (fd < 0) + fail_message("Cgroup file %s not found!\n"); + close(fd); + + fd = inotify_init(); + if (fd < 0) + fail_message("inotify_init() fails on %s!\n"); + if (inotify_add_watch(fd, file, IN_MODIFY) < 0) + fail_message("inotify_add_watch() fails on %s!\n"); + fds.fd = fd; + + /* + * poll waiting loop + */ + for (;;) { + int ret = poll(&fds, 1, 10000); + if (ret < 0) { + if (errno == EINTR) + continue; + perror("poll"); + exit(1); + } + if ((ret > 0) && (fds.revents & POLLIN)) + break; + } + if (verbose) { + struct inotify_event events[10]; + long len; + + usleep(1000); + len = read(fd, events, sizeof(events)); + printf("Number of events read = %ld\n", + len/sizeof(struct inotify_event)); + } + close(fd); + return 0; +}
linux-kselftest-mirror@lists.linaro.org