The series consists of two parts: - pids.events rework (originally v2, patches 1-6, - migration charging, patches 7-9.
The changes are independent in principle, I stacked them for (my) convenience and because they both deserve RFC:
1) Changed semantics of v2 pids.events - similar change was proposed for memory.swap.events:max [1] 2) Migration charging is obsolete concept
How are the new events supposed to be useful?
- pids.events.local:max - tells that cgroup's limit is hit (too tight?) - pids.events.local:max.imposed - tells that cgroup's workload was restricted (generalization of 'cgroup: fork rejected by pids controller in %s' message) - pids.events:* - "only" directs top-down search to cgroups of interest
The migration charging is motivated by apparenty surprising pids.current > pids.max because supervised processes are forked in supervisor's cgroup (more details in commit cgroup/pids: Enforce pids.max on task migrations too)
Changes from v2 (https://lore.kernel.org/r/20200205134426.10570-1-mkoutny@suse.com) - implemented pids.events.local (Tejun) - added migration charging
[1] https://lore.kernel.org/r/20230202155626.1829121-1-hannes@cmpxchg.org/
Michal Koutný (9): cgroup/pids: Remove superfluous zeroing cgroup/pids: Separate semantics of pids.events related to pids.max cgroup/pids: Make event counters hierarchical cgroup/pids: Add pids.events.local selftests: cgroup: Lexicographic order in Makefile selftests: cgroup: Add basic tests for pids controller cgroup/pids: Replace uncharge/charge pair with a single function cgroup/pids: Enforce pids.max on task migrations selftests: cgroup: Add tests pids controller
Documentation/admin-guide/cgroup-v1/pids.rst | 3 +- Documentation/admin-guide/cgroup-v2.rst | 22 +- include/linux/cgroup-defs.h | 7 +- kernel/cgroup/cgroup.c | 16 +- kernel/cgroup/pids.c | 206 +++++++++---- tools/testing/selftests/cgroup/Makefile | 25 +- tools/testing/selftests/cgroup/test_pids.c | 302 +++++++++++++++++++ 7 files changed, 514 insertions(+), 67 deletions(-) create mode 100644 tools/testing/selftests/cgroup/test_pids.c
base-commit: 026e680b0a08a62b1d948e5a8ca78700bfac0e6e
Atomic counters are in kzalloc'd struct. They are zeroed already and atomic64_t does not need special initialization (cf kernel/trace/trace_clock.c:trace_counter).
Signed-off-by: Michal Koutný mkoutny@suse.com --- kernel/cgroup/pids.c | 2 -- 1 file changed, 2 deletions(-)
diff --git a/kernel/cgroup/pids.c b/kernel/cgroup/pids.c index 7695e60bcb40..0e5ec7d59b4d 100644 --- a/kernel/cgroup/pids.c +++ b/kernel/cgroup/pids.c @@ -75,9 +75,7 @@ pids_css_alloc(struct cgroup_subsys_state *parent) if (!pids) return ERR_PTR(-ENOMEM);
- atomic64_set(&pids->counter, 0); atomic64_set(&pids->limit, PIDS_MAX); - atomic64_set(&pids->events_limit, 0); return &pids->css; }
Currently, when pids.max limit is breached in the hierarchy, the event is counted and reported in the cgroup where the forking task resides.
This decouples the limit and the notification caused by the limit making it hard to detect when the actual limit was effected.
Let's introduce new events: max The number of times the limit of the cgroup was hit.
max.imposed The number of times fork failed in the cgroup because of self or ancestor limit.
Since it changes semantics of the original "max" event, we introduce this change only in the v2 API of the controller.
Signed-off-by: Michal Koutný mkoutny@suse.com --- Documentation/admin-guide/cgroup-v1/pids.rst | 3 +- Documentation/admin-guide/cgroup-v2.rst | 12 ++++ kernel/cgroup/pids.c | 71 ++++++++++++++++---- 3 files changed, 73 insertions(+), 13 deletions(-)
diff --git a/Documentation/admin-guide/cgroup-v1/pids.rst b/Documentation/admin-guide/cgroup-v1/pids.rst index 6acebd9e72c8..0f9f9a7b1f6c 100644 --- a/Documentation/admin-guide/cgroup-v1/pids.rst +++ b/Documentation/admin-guide/cgroup-v1/pids.rst @@ -36,7 +36,8 @@ superset of parent/child/pids.current.
The pids.events file contains event counters:
- - max: Number of times fork failed because limit was hit. + - max: Number of times fork failed in the cgroup because limit was hit in + self or ancestors.
Example ------- diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst index 17e6e9565156..4f04538d688c 100644 --- a/Documentation/admin-guide/cgroup-v2.rst +++ b/Documentation/admin-guide/cgroup-v2.rst @@ -2186,6 +2186,18 @@ PID Interface Files The number of processes currently in the cgroup and its descendants.
+ pids.events + A read-only flat-keyed file which exists on non-root cgroups. Unless + specified otherwise, a value change in this file generates a file modified + event. The following entries are defined. + + max + The number of times the limit of the cgroup was hit. + + max.imposed + The number of times fork failed in the cgroup because of self + or ancestor limit. + Organisational operations are not blocked by cgroup policies, so it is possible to have pids.current > pids.max. This can be done by either setting the limit to be smaller than pids.current, or attaching enough diff --git a/kernel/cgroup/pids.c b/kernel/cgroup/pids.c index 0e5ec7d59b4d..471562609eef 100644 --- a/kernel/cgroup/pids.c +++ b/kernel/cgroup/pids.c @@ -38,6 +38,14 @@ #define PIDS_MAX (PID_MAX_LIMIT + 1ULL) #define PIDS_MAX_STR "max"
+enum pidcg_event { + /* Fork failed in subtree because this pids_cgroup limit was hit. */ + PIDCG_MAX, + /* Fork failed in this pids_cgroup because ancestor limit was hit. */ + PIDCG_MAX_IMPOSED, + NR_PIDCG_EVENTS, +}; + struct pids_cgroup { struct cgroup_subsys_state css;
@@ -52,8 +60,7 @@ struct pids_cgroup { /* Handle for "pids.events" */ struct cgroup_file events_file;
- /* Number of times fork failed because limit was hit. */ - atomic64_t events_limit; + atomic64_t events[NR_PIDCG_EVENTS]; };
static struct pids_cgroup *css_pids(struct cgroup_subsys_state *css) @@ -148,12 +155,13 @@ static void pids_charge(struct pids_cgroup *pids, int num) * pids_try_charge - hierarchically try to charge the pid count * @pids: the pid cgroup state * @num: the number of pids to charge + * @fail: storage of pid cgroup causing the fail * * This function follows the set limit. It will fail if the charge would cause * the new value to exceed the hierarchical limit. Returns 0 if the charge * succeeded, otherwise -EAGAIN. */ -static int pids_try_charge(struct pids_cgroup *pids, int num) +static int pids_try_charge(struct pids_cgroup *pids, int num, struct pids_cgroup **fail) { struct pids_cgroup *p, *q;
@@ -166,9 +174,10 @@ static int pids_try_charge(struct pids_cgroup *pids, int num) * p->limit is %PIDS_MAX then we know that this test will never * fail. */ - if (new > limit) + if (new > limit) { + *fail = p; goto revert; - + } /* * Not technically accurate if we go over limit somewhere up * the hierarchy, but that's tolerable for the watermark. @@ -236,7 +245,7 @@ static void pids_cancel_attach(struct cgroup_taskset *tset) static int pids_can_fork(struct task_struct *task, struct css_set *cset) { struct cgroup_subsys_state *css; - struct pids_cgroup *pids; + struct pids_cgroup *pids, *pids_over_limit; int err;
if (cset) @@ -244,15 +253,23 @@ static int pids_can_fork(struct task_struct *task, struct css_set *cset) else css = task_css_check(current, pids_cgrp_id, true); pids = css_pids(css); - err = pids_try_charge(pids, 1); + err = pids_try_charge(pids, 1, &pids_over_limit); if (err) { - /* Only log the first time events_limit is incremented. */ - if (atomic64_inc_return(&pids->events_limit) == 1) { + /* compatibility on v1 where events were notified in leaves. */ + if (!cgroup_subsys_on_dfl(pids_cgrp_subsys)) + pids_over_limit = pids; + + /* Only log the first time limit is hit. */ + if (atomic64_inc_return(&pids->events[PIDCG_MAX_IMPOSED]) == 1) { pr_info("cgroup: fork rejected by pids controller in "); - pr_cont_cgroup_path(css->cgroup); + pr_cont_cgroup_path(pids->css.cgroup); pr_cont("\n"); } + atomic64_inc(&pids_over_limit->events[PIDCG_MAX]); + cgroup_file_notify(&pids->events_file); + if (pids_over_limit != pids) + cgroup_file_notify(&pids_over_limit->events_file); } return err; } @@ -341,7 +358,16 @@ static int pids_events_show(struct seq_file *sf, void *v) { struct pids_cgroup *pids = css_pids(seq_css(sf));
- seq_printf(sf, "max %lld\n", (s64)atomic64_read(&pids->events_limit)); + seq_printf(sf, "max %lld\n", (s64)atomic64_read(&pids->events[PIDCG_MAX])); + seq_printf(sf, "max.imposed %lld\n", (s64)atomic64_read(&pids->events[PIDCG_MAX_IMPOSED])); + return 0; +} + +static int pids_events_show_legacy(struct seq_file *sf, void *v) +{ + struct pids_cgroup *pids = css_pids(seq_css(sf)); + + seq_printf(sf, "max%lld\n", (s64)atomic64_read(&pids->events[PIDCG_MAX_IMPOSED])); return 0; }
@@ -371,6 +397,27 @@ static struct cftype pids_files[] = { { } /* terminate */ };
+static struct cftype pids_files_legacy[] = { + { + .name = "max", + .write = pids_max_write, + .seq_show = pids_max_show, + .flags = CFTYPE_NOT_ON_ROOT, + }, + { + .name = "current", + .read_s64 = pids_current_read, + .flags = CFTYPE_NOT_ON_ROOT, + }, + { + .name = "events", + .seq_show = pids_events_show_legacy, + .file_offset = offsetof(struct pids_cgroup, events_file), + .flags = CFTYPE_NOT_ON_ROOT, + }, + { } /* terminate */ +}; + struct cgroup_subsys pids_cgrp_subsys = { .css_alloc = pids_css_alloc, .css_free = pids_css_free, @@ -379,7 +426,7 @@ struct cgroup_subsys pids_cgrp_subsys = { .can_fork = pids_can_fork, .cancel_fork = pids_cancel_fork, .release = pids_release, - .legacy_cftypes = pids_files, .dfl_cftypes = pids_files, + .legacy_cftypes = pids_files_legacy, .threaded = true, };
Hello,
On Fri, Apr 05, 2024 at 07:05:41PM +0200, Michal Koutný wrote:
Currently, when pids.max limit is breached in the hierarchy, the event is counted and reported in the cgroup where the forking task resides.
This decouples the limit and the notification caused by the limit making it hard to detect when the actual limit was effected.
Let's introduce new events: max The number of times the limit of the cgroup was hit.
max.imposed The number of times fork failed in the cgroup because of self or ancestor limit.
The whole series make sense to me. I'm not sure about max.imposed field name. Maybe a name which clearly signfies rejection of forks would be clearer? Johannes, what do you think?
Thanks.
On Mon, Apr 08, 2024 at 07:55:38AM -1000, Tejun Heo wrote:
Hello,
On Fri, Apr 05, 2024 at 07:05:41PM +0200, Michal Koutný wrote:
Currently, when pids.max limit is breached in the hierarchy, the event is counted and reported in the cgroup where the forking task resides.
This decouples the limit and the notification caused by the limit making it hard to detect when the actual limit was effected.
Let's introduce new events: max The number of times the limit of the cgroup was hit.
max.imposed The number of times fork failed in the cgroup because of self or ancestor limit.
The whole series make sense to me. I'm not sure about max.imposed field name. Maybe a name which clearly signfies rejection of forks would be clearer? Johannes, what do you think?
The max event at the level where the limit is set (and up, for hierarchical accounting) makes sense to me.
max.imposed is conceptually not entirely unprecedented, but something we've tried to avoid. Usually the idea is that events correspond to specific cgroup limitations at that level. Failures due to constraints higher up could be from anything, including system-level shortages.
IOW, events are supposed to be more about "how many times did this limit here trigger", and less about "how many times did something happen to the tasks local to this group".
It's a bit arbitrary and not perfectly followed everywhere, but I think there is value in trying to maintain that distinction, so that somebody looking at those files doesn't have to rack their brains or look up every counter in the docs to figure out what it's tracking.
It's at least true for the misc controller, and for most of memcg - with the weird exception of the swap.max events which we've tried to fix before...
For "things that are happening to the tasks in this group", would it make more sense to have an e.g. pids.stat::forkfail instead?
(Or just not have that event at all? I'm not sure if it's actually needed or whether you kept it only to maintain some form of the information that is currently provided by the pr_info()).
On Mon, Apr 08, 2024 at 07:55:38AM -1000, Tejun Heo tj@kernel.org wrote:
The whole series make sense to me.
Including the migration charging? (Asking whether I should keep it stacked in v4 posting.)
Thanks, Michal
On Fri, Apr 12, 2024 at 04:23:24PM +0200, Michal Koutný wrote:
On Mon, Apr 08, 2024 at 07:55:38AM -1000, Tejun Heo tj@kernel.org wrote:
The whole series make sense to me.
Including the migration charging? (Asking whether I should keep it stacked in v4 posting.)
Oh, let's separate that part out. I'm not sure about that. The problem with can_attach failures is that they're really opaque and the more we do it the less we'll be able to tell where the failures are coming from, so I'm not very enthusiastic about them.
Thanks.
The pids.events file should honor the hierarchy, so make the events propagate from their origin up to the root on the unified hierarchy. The v1 behavior remains non-hierarchical.
Signed-off-by: Michal Koutný mkoutny@suse.com --- Documentation/admin-guide/cgroup-v2.rst | 4 ++- kernel/cgroup/pids.c | 46 ++++++++++++++++--------- 2 files changed, 33 insertions(+), 17 deletions(-)
diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst index 4f04538d688c..5d4c505cae06 100644 --- a/Documentation/admin-guide/cgroup-v2.rst +++ b/Documentation/admin-guide/cgroup-v2.rst @@ -2189,7 +2189,9 @@ PID Interface Files pids.events A read-only flat-keyed file which exists on non-root cgroups. Unless specified otherwise, a value change in this file generates a file modified - event. The following entries are defined. + event. Fields in this file are hierarchical and the file modified event + can be generated due to an event down the hierarchy. The following + entries are defined.
max The number of times the limit of the cgroup was hit. diff --git a/kernel/cgroup/pids.c b/kernel/cgroup/pids.c index 471562609eef..76c0a97e42da 100644 --- a/kernel/cgroup/pids.c +++ b/kernel/cgroup/pids.c @@ -238,6 +238,34 @@ static void pids_cancel_attach(struct cgroup_taskset *tset) } }
+static void pids_event(struct pids_cgroup *pids_forking, + struct pids_cgroup *pids_over_limit) +{ + struct pids_cgroup *p = pids_forking; + bool limit = false; + + /* Only log the first time limit is hit. */ + if (atomic64_inc_return(&p->events[PIDCG_MAX_IMPOSED]) == 1) { + pr_info("cgroup: fork rejected by pids controller in "); + pr_cont_cgroup_path(p->css.cgroup); + pr_cont("\n"); + } + /* Events are only notified in pids_forking on v1 */ + if (!cgroup_subsys_on_dfl(pids_cgrp_subsys)) + return; + + for (; parent_pids(p); p = parent_pids(p)) { + atomic64_inc(&p->events[PIDCG_MAX_IMPOSED]); + + if (p == pids_over_limit) + limit = true; + if (limit) + atomic64_inc(&p->events[PIDCG_MAX]); + + cgroup_file_notify(&p->events_file); + } +} + /* * task_css_check(true) in pids_can_fork() and pids_cancel_fork() relies * on cgroup_threadgroup_change_begin() held by the copy_process(). @@ -254,23 +282,9 @@ static int pids_can_fork(struct task_struct *task, struct css_set *cset) css = task_css_check(current, pids_cgrp_id, true); pids = css_pids(css); err = pids_try_charge(pids, 1, &pids_over_limit); - if (err) { - /* compatibility on v1 where events were notified in leaves. */ - if (!cgroup_subsys_on_dfl(pids_cgrp_subsys)) - pids_over_limit = pids; - - /* Only log the first time limit is hit. */ - if (atomic64_inc_return(&pids->events[PIDCG_MAX_IMPOSED]) == 1) { - pr_info("cgroup: fork rejected by pids controller in "); - pr_cont_cgroup_path(pids->css.cgroup); - pr_cont("\n"); - } - atomic64_inc(&pids_over_limit->events[PIDCG_MAX]); + if (err) + pids_event(pids, pids_over_limit);
- cgroup_file_notify(&pids->events_file); - if (pids_over_limit != pids) - cgroup_file_notify(&pids_over_limit->events_file); - } return err; }
Hierarchical counting of events is not practical for watching when a particular pids.max is being hit. Therefore introduce .local flavor of events file (akin to memory controller) that collects only events relevant to given cgroup.
Signed-off-by: Michal Koutný mkoutny@suse.com --- kernel/cgroup/pids.c | 41 +++++++++++++++++++++++++++++++++-------- 1 file changed, 33 insertions(+), 8 deletions(-)
diff --git a/kernel/cgroup/pids.c b/kernel/cgroup/pids.c index 76c0a97e42da..f5f81274658e 100644 --- a/kernel/cgroup/pids.c +++ b/kernel/cgroup/pids.c @@ -57,10 +57,12 @@ struct pids_cgroup { atomic64_t limit; int64_t watermark;
- /* Handle for "pids.events" */ + /* Handles for pids.events[.local] */ struct cgroup_file events_file; + struct cgroup_file events_local_file;
atomic64_t events[NR_PIDCG_EVENTS]; + atomic64_t events_local[NR_PIDCG_EVENTS]; };
static struct pids_cgroup *css_pids(struct cgroup_subsys_state *css) @@ -245,11 +247,12 @@ static void pids_event(struct pids_cgroup *pids_forking, bool limit = false;
/* Only log the first time limit is hit. */ - if (atomic64_inc_return(&p->events[PIDCG_MAX_IMPOSED]) == 1) { + if (atomic64_inc_return(&p->events_local[PIDCG_MAX_IMPOSED]) == 1) { pr_info("cgroup: fork rejected by pids controller in "); pr_cont_cgroup_path(p->css.cgroup); pr_cont("\n"); } + cgroup_file_notify(&p->events_local_file); /* Events are only notified in pids_forking on v1 */ if (!cgroup_subsys_on_dfl(pids_cgrp_subsys)) return; @@ -257,8 +260,11 @@ static void pids_event(struct pids_cgroup *pids_forking, for (; parent_pids(p); p = parent_pids(p)) { atomic64_inc(&p->events[PIDCG_MAX_IMPOSED]);
- if (p == pids_over_limit) + if (p == pids_over_limit) { limit = true; + atomic64_inc(&p->events_local[PIDCG_MAX]); + cgroup_file_notify(&p->events_local_file); + } if (limit) atomic64_inc(&p->events[PIDCG_MAX]);
@@ -368,12 +374,25 @@ static s64 pids_peak_read(struct cgroup_subsys_state *css, return READ_ONCE(pids->watermark); }
-static int pids_events_show(struct seq_file *sf, void *v) +static int __pids_events_show(struct seq_file *sf, bool local) { struct pids_cgroup *pids = css_pids(seq_css(sf)); + atomic64_t *events = local ? pids->events_local : pids->events; + + seq_printf(sf, "max %lld\n", (s64)atomic64_read(&events[PIDCG_MAX])); + seq_printf(sf, "max.imposed %lld\n", (s64)atomic64_read(&events[PIDCG_MAX_IMPOSED])); + return 0; +} + +static int pids_events_show(struct seq_file *sf, void *v) +{ + __pids_events_show(sf, false); + return 0; +}
- seq_printf(sf, "max %lld\n", (s64)atomic64_read(&pids->events[PIDCG_MAX])); - seq_printf(sf, "max.imposed %lld\n", (s64)atomic64_read(&pids->events[PIDCG_MAX_IMPOSED])); +static int pids_events_local_show(struct seq_file *sf, void *v) +{ + __pids_events_show(sf, true); return 0; }
@@ -381,7 +400,7 @@ static int pids_events_show_legacy(struct seq_file *sf, void *v) { struct pids_cgroup *pids = css_pids(seq_css(sf));
- seq_printf(sf, "max%lld\n", (s64)atomic64_read(&pids->events[PIDCG_MAX_IMPOSED])); + seq_printf(sf, "max%lld\n", (s64)atomic64_read(&pids->events_local[PIDCG_MAX_IMPOSED])); return 0; }
@@ -408,6 +427,12 @@ static struct cftype pids_files[] = { .file_offset = offsetof(struct pids_cgroup, events_file), .flags = CFTYPE_NOT_ON_ROOT, }, + { + .name = "events.local", + .seq_show = pids_events_local_show, + .file_offset = offsetof(struct pids_cgroup, events_file), + .flags = CFTYPE_NOT_ON_ROOT, + }, { } /* terminate */ };
@@ -426,7 +451,7 @@ static struct cftype pids_files_legacy[] = { { .name = "events", .seq_show = pids_events_show_legacy, - .file_offset = offsetof(struct pids_cgroup, events_file), + .file_offset = offsetof(struct pids_cgroup, events_local_file), .flags = CFTYPE_NOT_ON_ROOT, }, { } /* terminate */
This will reduce number of conflicts when modifying the lists.
Signed-off-by: Michal Koutný mkoutny@suse.com --- tools/testing/selftests/cgroup/Makefile | 23 ++++++++++++----------- 1 file changed, 12 insertions(+), 11 deletions(-)
diff --git a/tools/testing/selftests/cgroup/Makefile b/tools/testing/selftests/cgroup/Makefile index 00b441928909..f3e1ef69e88d 100644 --- a/tools/testing/selftests/cgroup/Makefile +++ b/tools/testing/selftests/cgroup/Makefile @@ -6,26 +6,27 @@ all: ${HELPER_PROGS} TEST_FILES := with_stress.sh TEST_PROGS := test_stress.sh test_cpuset_prs.sh TEST_GEN_FILES := wait_inotify -TEST_GEN_PROGS = test_memcontrol -TEST_GEN_PROGS += test_kmem -TEST_GEN_PROGS += test_core -TEST_GEN_PROGS += test_freezer -TEST_GEN_PROGS += test_kill +# Keep the lists lexicographically sorted +TEST_GEN_PROGS = test_core TEST_GEN_PROGS += test_cpu TEST_GEN_PROGS += test_cpuset -TEST_GEN_PROGS += test_zswap +TEST_GEN_PROGS += test_freezer TEST_GEN_PROGS += test_hugetlb_memcg +TEST_GEN_PROGS += test_kill +TEST_GEN_PROGS += test_kmem +TEST_GEN_PROGS += test_memcontrol +TEST_GEN_PROGS += test_zswap
LOCAL_HDRS += $(selfdir)/clone3/clone3_selftests.h $(selfdir)/pidfd/pidfd.h
include ../lib.mk
-$(OUTPUT)/test_memcontrol: cgroup_util.c -$(OUTPUT)/test_kmem: cgroup_util.c $(OUTPUT)/test_core: cgroup_util.c -$(OUTPUT)/test_freezer: cgroup_util.c -$(OUTPUT)/test_kill: cgroup_util.c $(OUTPUT)/test_cpu: cgroup_util.c $(OUTPUT)/test_cpuset: cgroup_util.c -$(OUTPUT)/test_zswap: cgroup_util.c +$(OUTPUT)/test_freezer: cgroup_util.c $(OUTPUT)/test_hugetlb_memcg: cgroup_util.c +$(OUTPUT)/test_kill: cgroup_util.c +$(OUTPUT)/test_kmem: cgroup_util.c +$(OUTPUT)/test_memcontrol: cgroup_util.c +$(OUTPUT)/test_zswap: cgroup_util.c
This commit adds (and wires in) new test program for checking basic pids controller functionality -- restricting tasks in a cgroup and correct event counting.
Signed-off-by: Michal Koutný mkoutny@suse.com --- tools/testing/selftests/cgroup/Makefile | 2 + tools/testing/selftests/cgroup/test_pids.c | 187 +++++++++++++++++++++ 2 files changed, 189 insertions(+) create mode 100644 tools/testing/selftests/cgroup/test_pids.c
diff --git a/tools/testing/selftests/cgroup/Makefile b/tools/testing/selftests/cgroup/Makefile index f3e1ef69e88d..f5f0886a2c4a 100644 --- a/tools/testing/selftests/cgroup/Makefile +++ b/tools/testing/selftests/cgroup/Makefile @@ -15,6 +15,7 @@ TEST_GEN_PROGS += test_hugetlb_memcg TEST_GEN_PROGS += test_kill TEST_GEN_PROGS += test_kmem TEST_GEN_PROGS += test_memcontrol +TEST_GEN_PROGS += test_pids TEST_GEN_PROGS += test_zswap
LOCAL_HDRS += $(selfdir)/clone3/clone3_selftests.h $(selfdir)/pidfd/pidfd.h @@ -29,4 +30,5 @@ $(OUTPUT)/test_hugetlb_memcg: cgroup_util.c $(OUTPUT)/test_kill: cgroup_util.c $(OUTPUT)/test_kmem: cgroup_util.c $(OUTPUT)/test_memcontrol: cgroup_util.c +$(OUTPUT)/test_pids: cgroup_util.c $(OUTPUT)/test_zswap: cgroup_util.c diff --git a/tools/testing/selftests/cgroup/test_pids.c b/tools/testing/selftests/cgroup/test_pids.c new file mode 100644 index 000000000000..c1c3a3965624 --- /dev/null +++ b/tools/testing/selftests/cgroup/test_pids.c @@ -0,0 +1,187 @@ +// SPDX-License-Identifier: GPL-2.0 +#define _GNU_SOURCE + +#include <errno.h> +#include <linux/limits.h> +#include <signal.h> +#include <string.h> +#include <sys/stat.h> +#include <sys/types.h> +#include <unistd.h> + +#include "../kselftest.h" +#include "cgroup_util.h" + +static int run_success(const char *cgroup, void *arg) +{ + return 0; +} + +static int run_pause(const char *cgroup, void *arg) +{ + return pause(); +} + +/* + * This test checks that pids.max prevents forking new children above the + * specified limit in the cgroup. + */ +static int test_pids_max(const char *root) +{ + int ret = KSFT_FAIL; + char *cg_pids; + int pid; + + + cg_pids = cg_name(root, "pids_test"); + if (!cg_pids) + goto cleanup; + + if (cg_create(cg_pids)) + goto cleanup; + + if (cg_read_strcmp(cg_pids, "pids.max", "max\n")) + goto cleanup; + + if (cg_write(cg_pids, "pids.max", "2")) + goto cleanup; + + if (cg_enter_current(cg_pids)) + goto cleanup; + + pid = cg_run_nowait(cg_pids, run_pause, NULL); + if (pid < 0) + goto cleanup; + + if (cg_run_nowait(cg_pids, run_success, NULL) != -1 || errno != EAGAIN) + goto cleanup; + + if (kill(pid, SIGINT)) + goto cleanup; + + ret = KSFT_PASS; + +cleanup: + cg_enter_current(root); + cg_destroy(cg_pids); + free(cg_pids); + + return ret; +} + +/* + * This test checks that pids.max prevents forking new children above the + * specified limit in the cgroup. + */ +static int test_pids_events(const char *root) +{ + int ret = KSFT_FAIL; + char *cg_parent = NULL, *cg_child = NULL; + int pid; + + + cg_parent = cg_name(root, "pids_parent"); + cg_child = cg_name(cg_parent, "pids_child"); + if (!cg_parent || !cg_child) + goto cleanup; + + if (cg_create(cg_parent)) + goto cleanup; + if (cg_write(cg_parent, "cgroup.subtree_control", "+pids")) + goto cleanup; + if (cg_create(cg_child)) + goto cleanup; + + if (cg_write(cg_parent, "pids.max", "2")) + goto cleanup; + + if (cg_read_strcmp(cg_child, "pids.max", "max\n")) + goto cleanup; + + if (cg_enter_current(cg_child)) + goto cleanup; + + pid = cg_run_nowait(cg_child, run_pause, NULL); + if (pid < 0) + goto cleanup; + + if (cg_run_nowait(cg_child, run_success, NULL) != -1 || errno != EAGAIN) + goto cleanup; + + if (kill(pid, SIGINT)) + goto cleanup; + + + if (cg_read_key_long(cg_child, "pids.events", "max ") != 0) + goto cleanup; + if (cg_read_key_long(cg_child, "pids.events", "max.imposed ") != 1) + goto cleanup; + + if (cg_read_key_long(cg_parent, "pids.events", "max ") != 1) + goto cleanup; + if (cg_read_key_long(cg_parent, "pids.events", "max.imposed ") != 1) + goto cleanup; + + + ret = KSFT_PASS; + +cleanup: + cg_enter_current(root); + if (cg_child) + cg_destroy(cg_child); + if (cg_parent) + cg_destroy(cg_parent); + free(cg_child); + free(cg_parent); + + return ret; +} + + + +#define T(x) { x, #x } +struct pids_test { + int (*fn)(const char *root); + const char *name; +} tests[] = { + T(test_pids_max), + T(test_pids_events), +}; +#undef T + +int main(int argc, char **argv) +{ + char root[PATH_MAX]; + int i, ret = EXIT_SUCCESS; + + if (cg_find_unified_root(root, sizeof(root))) + ksft_exit_skip("cgroup v2 isn't mounted\n"); + + /* + * Check that pids controller is available: + * pids is listed in cgroup.controllers + */ + if (cg_read_strstr(root, "cgroup.controllers", "pids")) + ksft_exit_skip("pids controller isn't available\n"); + + if (cg_read_strstr(root, "cgroup.subtree_control", "pids")) + if (cg_write(root, "cgroup.subtree_control", "+pids")) + ksft_exit_skip("Failed to set pids controller\n"); + + for (i = 0; i < ARRAY_SIZE(tests); i++) { + switch (tests[i].fn(root)) { + case KSFT_PASS: + ksft_test_result_pass("%s\n", tests[i].name); + break; + case KSFT_SKIP: + ksft_test_result_skip("%s\n", tests[i].name); + break; + default: + ret = EXIT_FAILURE; + ksft_test_result_fail("%s\n", tests[i].name); + break; + } + } + + return ret; +}
On 4/5/24 10:05 PM, Michal Koutný wrote:
This commit adds (and wires in) new test program for checking basic pids controller functionality -- restricting tasks in a cgroup and correct event counting.
Signed-off-by: Michal Koutný mkoutny@suse.com
tools/testing/selftests/cgroup/Makefile | 2 + tools/testing/selftests/cgroup/test_pids.c | 187 +++++++++++++++++++++
Please create/add test_pid to .gitignore file.
2 files changed, 189 insertions(+) create mode 100644 tools/testing/selftests/cgroup/test_pids.c
diff --git a/tools/testing/selftests/cgroup/Makefile b/tools/testing/selftests/cgroup/Makefile index f3e1ef69e88d..f5f0886a2c4a 100644 --- a/tools/testing/selftests/cgroup/Makefile +++ b/tools/testing/selftests/cgroup/Makefile @@ -15,6 +15,7 @@ TEST_GEN_PROGS += test_hugetlb_memcg TEST_GEN_PROGS += test_kill TEST_GEN_PROGS += test_kmem TEST_GEN_PROGS += test_memcontrol +TEST_GEN_PROGS += test_pids TEST_GEN_PROGS += test_zswap LOCAL_HDRS += $(selfdir)/clone3/clone3_selftests.h $(selfdir)/pidfd/pidfd.h @@ -29,4 +30,5 @@ $(OUTPUT)/test_hugetlb_memcg: cgroup_util.c $(OUTPUT)/test_kill: cgroup_util.c $(OUTPUT)/test_kmem: cgroup_util.c $(OUTPUT)/test_memcontrol: cgroup_util.c +$(OUTPUT)/test_pids: cgroup_util.c $(OUTPUT)/test_zswap: cgroup_util.c diff --git a/tools/testing/selftests/cgroup/test_pids.c b/tools/testing/selftests/cgroup/test_pids.c new file mode 100644 index 000000000000..c1c3a3965624 --- /dev/null +++ b/tools/testing/selftests/cgroup/test_pids.c @@ -0,0 +1,187 @@ +// SPDX-License-Identifier: GPL-2.0 +#define _GNU_SOURCE
+#include <errno.h> +#include <linux/limits.h> +#include <signal.h> +#include <string.h> +#include <sys/stat.h> +#include <sys/types.h> +#include <unistd.h>
+#include "../kselftest.h" +#include "cgroup_util.h"
+static int run_success(const char *cgroup, void *arg) +{
- return 0;
+}
+static int run_pause(const char *cgroup, void *arg) +{
- return pause();
+}
+/*
- This test checks that pids.max prevents forking new children above the
- specified limit in the cgroup.
- */
+static int test_pids_max(const char *root) +{
- int ret = KSFT_FAIL;
- char *cg_pids;
- int pid;
Please remove extra line.
- cg_pids = cg_name(root, "pids_test");
- if (!cg_pids)
goto cleanup;
- if (cg_create(cg_pids))
goto cleanup;
- if (cg_read_strcmp(cg_pids, "pids.max", "max\n"))
goto cleanup;
- if (cg_write(cg_pids, "pids.max", "2"))
goto cleanup;
- if (cg_enter_current(cg_pids))
goto cleanup;
- pid = cg_run_nowait(cg_pids, run_pause, NULL);
- if (pid < 0)
goto cleanup;
- if (cg_run_nowait(cg_pids, run_success, NULL) != -1 || errno != EAGAIN)
goto cleanup;
- if (kill(pid, SIGINT))
goto cleanup;
- ret = KSFT_PASS;
+cleanup:
- cg_enter_current(root);
- cg_destroy(cg_pids);
- free(cg_pids);
- return ret;
+}
+/*
- This test checks that pids.max prevents forking new children above the
- specified limit in the cgroup.
- */
+static int test_pids_events(const char *root) +{
- int ret = KSFT_FAIL;
- char *cg_parent = NULL, *cg_child = NULL;
- int pid;
- cg_parent = cg_name(root, "pids_parent");
- cg_child = cg_name(cg_parent, "pids_child");
- if (!cg_parent || !cg_child)
goto cleanup;
- if (cg_create(cg_parent))
goto cleanup;
- if (cg_write(cg_parent, "cgroup.subtree_control", "+pids"))
goto cleanup;
- if (cg_create(cg_child))
goto cleanup;
- if (cg_write(cg_parent, "pids.max", "2"))
goto cleanup;
- if (cg_read_strcmp(cg_child, "pids.max", "max\n"))
goto cleanup;
- if (cg_enter_current(cg_child))
goto cleanup;
- pid = cg_run_nowait(cg_child, run_pause, NULL);
- if (pid < 0)
goto cleanup;
- if (cg_run_nowait(cg_child, run_success, NULL) != -1 || errno != EAGAIN)
goto cleanup;
- if (kill(pid, SIGINT))
goto cleanup;
Remove extra line.
- if (cg_read_key_long(cg_child, "pids.events", "max ") != 0)
goto cleanup;
- if (cg_read_key_long(cg_child, "pids.events", "max.imposed ") != 1)
goto cleanup;
- if (cg_read_key_long(cg_parent, "pids.events", "max ") != 1)
goto cleanup;
- if (cg_read_key_long(cg_parent, "pids.events", "max.imposed ") != 1)
goto cleanup;
- ret = KSFT_PASS;
+cleanup:
- cg_enter_current(root);
- if (cg_child)
cg_destroy(cg_child);
- if (cg_parent)
cg_destroy(cg_parent);
- free(cg_child);
- free(cg_parent);
- return ret;
+}
+#define T(x) { x, #x } +struct pids_test {
- int (*fn)(const char *root);
- const char *name;
+} tests[] = {
- T(test_pids_max),
- T(test_pids_events),
+}; +#undef T
+int main(int argc, char **argv) +{
- char root[PATH_MAX];
- int i, ret = EXIT_SUCCESS;
The ksft_print_header(); ksft_set_plan(total_number_of_tests); are missing. Please use all of the ksft APIs to make the test TAP compliant.
- if (cg_find_unified_root(root, sizeof(root)))
ksft_exit_skip("cgroup v2 isn't mounted\n");
- /*
* Check that pids controller is available:
* pids is listed in cgroup.controllers
*/
- if (cg_read_strstr(root, "cgroup.controllers", "pids"))
ksft_exit_skip("pids controller isn't available\n");
- if (cg_read_strstr(root, "cgroup.subtree_control", "pids"))
if (cg_write(root, "cgroup.subtree_control", "+pids"))
ksft_exit_skip("Failed to set pids controller\n");
- for (i = 0; i < ARRAY_SIZE(tests); i++) {
switch (tests[i].fn(root)) {
case KSFT_PASS:
ksft_test_result_pass("%s\n", tests[i].name);
break;
case KSFT_SKIP:
ksft_test_result_skip("%s\n", tests[i].name);
break;
default:
ret = EXIT_FAILURE;
ksft_test_result_fail("%s\n", tests[i].name);
break;
Use ksft_test_result_report() instead of swith-case here.
}
- }
- return ret;
+}
On Sun, Apr 07, 2024 at 02:37:44AM +0500, Muhammad Usama Anjum usama.anjum@collabora.com wrote:
The ksft_print_header(); ksft_set_plan(total_number_of_tests); are missing. Please use all of the ksft APIs to make the test TAP compliant.
Will do.
- for (i = 0; i < ARRAY_SIZE(tests); i++) {
switch (tests[i].fn(root)) {
case KSFT_PASS:
ksft_test_result_pass("%s\n", tests[i].name);
break;
case KSFT_SKIP:
ksft_test_result_skip("%s\n", tests[i].name);
break;
default:
ret = EXIT_FAILURE;
ksft_test_result_fail("%s\n", tests[i].name);
break;
Use ksft_test_result_report() instead of swith-case here.
Do you mean ksft_test_result()? That one cannot distinguish the KSFT_SKIP case. Or ksft_test_result_code(tests[i].fn(root), tests[i].name)?
Would the existing ksft_test_resul_*() calls inside switch-case still TAP-work?
Thanks, Michal
On 4/8/24 4:29 PM, Michal Koutný wrote:
On Sun, Apr 07, 2024 at 02:37:44AM +0500, Muhammad Usama Anjum usama.anjum@collabora.com wrote:
The ksft_print_header(); ksft_set_plan(total_number_of_tests); are missing. Please use all of the ksft APIs to make the test TAP compliant.
Will do.
- for (i = 0; i < ARRAY_SIZE(tests); i++) {
switch (tests[i].fn(root)) {
case KSFT_PASS:
ksft_test_result_pass("%s\n", tests[i].name);
break;
case KSFT_SKIP:
ksft_test_result_skip("%s\n", tests[i].name);
break;
default:
ret = EXIT_FAILURE;
ksft_test_result_fail("%s\n", tests[i].name);
break;
Use ksft_test_result_report() instead of swith-case here.
Do you mean ksft_test_result()? That one cannot distinguish the KSFT_SKIP case. Or ksft_test_result_code(tests[i].fn(root), tests[i].name)?
No, this doesn't seem useful here.
Would the existing ksft_test_resul_*() calls inside switch-case still TAP-work?
This part of your switch-case are correct. It just that by using ksft_test_result_report you can achieve the same thing. It has has SKIP support.
ksft_test_result_report(tests[i].fn(root), tests[i].name)
Thanks, Michal
On Mon, Apr 08, 2024 at 04:53:11PM +0500, Muhammad Usama Anjum usama.anjum@collabora.com wrote:
ksft_test_result_report(tests[i].fn(root), tests[i].name)
$ git grep ksft_test_result_report v6.9-rc3 -- (empty result)
I can't find that helper. Is that in some devel repositories?
Michal
On 4/8/24 5:01 PM, Michal Koutný wrote:
On Mon, Apr 08, 2024 at 04:53:11PM +0500, Muhammad Usama Anjum usama.anjum@collabora.com wrote:
ksft_test_result_report(tests[i].fn(root), tests[i].name)
$ git grep ksft_test_result_report v6.9-rc3 -- (empty result)
I can't find that helper. Is that in some devel repositories?
Sorry, I always do development on next. So it has been added recently. Try searching it on next:
git grep ksft_test_result_report next-20240404 --
Michal
On 4/8/24 08:04, Muhammad Usama Anjum wrote:
On 4/8/24 5:01 PM, Michal Koutný wrote:
On Mon, Apr 08, 2024 at 04:53:11PM +0500, Muhammad Usama Anjum usama.anjum@collabora.com wrote:
ksft_test_result_report(tests[i].fn(root), tests[i].name)
$ git grep ksft_test_result_report v6.9-rc3 -- (empty result)
I can't find that helper. Is that in some devel repositories?
Sorry, I always do development on next. So it has been added recently. Try searching it on next:
git grep ksft_test_result_report next-20240404 --
I don't believe it is a good idea to make this patch having a dependency on another set of patches in -next because the test won't run in a non-next environment. We can always have additional patches later on to modify the tests to use the newly available APIs.
Cheers, Longman
Michal
On 4/9/24 5:12 AM, Waiman Long wrote:
On 4/8/24 08:04, Muhammad Usama Anjum wrote:
On 4/8/24 5:01 PM, Michal Koutný wrote:
On Mon, Apr 08, 2024 at 04:53:11PM +0500, Muhammad Usama Anjum usama.anjum@collabora.com wrote:
ksft_test_result_report(tests[i].fn(root), tests[i].name)
$ git grep ksft_test_result_report v6.9-rc3 -- (empty result)
I can't find that helper. Is that in some devel repositories?
Sorry, I always do development on next. So it has been added recently. Try searching it on next:
git grep ksft_test_result_report next-20240404 --
I don't believe it is a good idea to make this patch having a dependency on another set of patches in -next because the test won't run in a non-next environment. We can always have additional patches later on to modify the tests to use the newly available APIs.
Sure, it is okay with me.
Cheers, Longman
Michal
No functional change intended. This rework reduces modifications of pids counters only to a minimal subtree of uncharged/charged cgroups.
Signed-off-by: Michal Koutný mkoutny@suse.com --- kernel/cgroup/pids.c | 80 ++++++++++++++++++++++++++------------------ 1 file changed, 47 insertions(+), 33 deletions(-)
diff --git a/kernel/cgroup/pids.c b/kernel/cgroup/pids.c index f5f81274658e..9df8a209a6e2 100644 --- a/kernel/cgroup/pids.c +++ b/kernel/cgroup/pids.c @@ -133,41 +133,23 @@ static void pids_uncharge(struct pids_cgroup *pids, int num) pids_cancel(p, num); }
-/** - * pids_charge - hierarchically charge the pid count - * @pids: the pid cgroup state - * @num: the number of pids to charge - * - * This function does *not* follow the pid limit set. It cannot fail and the new - * pid count may exceed the limit. This is only used for reverting failed - * attaches, where there is no other way out than violating the limit. - */ -static void pids_charge(struct pids_cgroup *pids, int num) -{ - struct pids_cgroup *p; - - for (p = pids; parent_pids(p); p = parent_pids(p)) { - int64_t new = atomic64_add_return(num, &p->counter); - - pids_update_watermark(p, new); - } -} - /** * pids_try_charge - hierarchically try to charge the pid count * @pids: the pid cgroup state * @num: the number of pids to charge + * @root: charge only under this root (NULL is global root) * @fail: storage of pid cgroup causing the fail * * This function follows the set limit. It will fail if the charge would cause - * the new value to exceed the hierarchical limit. Returns 0 if the charge - * succeeded, otherwise -EAGAIN. + * the new value to exceed the hierarchical limit and fail is set. Returns 0 if + * no limit was hit, otherwise -EAGAIN. */ -static int pids_try_charge(struct pids_cgroup *pids, int num, struct pids_cgroup **fail) +static int pids_try_charge(struct pids_cgroup *pids, int num, struct pids_cgroup *root, struct pids_cgroup **fail) { struct pids_cgroup *p, *q; + int ret = 0;
- for (p = pids; parent_pids(p); p = parent_pids(p)) { + for (p = pids; parent_pids(p) && p != root; p = parent_pids(p)) { int64_t new = atomic64_add_return(num, &p->counter); int64_t limit = atomic64_read(&p->limit);
@@ -177,8 +159,11 @@ static int pids_try_charge(struct pids_cgroup *pids, int num, struct pids_cgroup * fail. */ if (new > limit) { - *fail = p; - goto revert; + ret = -EAGAIN; + if (fail) { + *fail = p; + goto revert; + } } /* * Not technically accurate if we go over limit somewhere up @@ -187,14 +172,45 @@ static int pids_try_charge(struct pids_cgroup *pids, int num, struct pids_cgroup pids_update_watermark(p, new); }
- return 0; + return ret;
revert: for (q = pids; q != p; q = parent_pids(q)) pids_cancel(q, num); pids_cancel(p, num);
- return -EAGAIN; + return ret; +} + +/** + * pids_tranfer_charge - charge/uncharge in subtree betwee src and dst + * @src: pid cgroup state to uncharge + * @dst: pid cgroup state to charge + * @num: the number of pids to transfer + * + * The function updates charged pids in subtree whose root is the closest + * common ancestor of @src and @dst. This root and its ancestors are not + * modified (their limits are not enacted). + * + * Returns 0 if no limit was hit, -EAGAIN if a limit on path [@dst, @comm) was + * hit (charges are transferred despite the limit). + */ +static int pids_tranfer_charge(struct pids_cgroup *src, struct pids_cgroup *dst, int num) +{ + struct pids_cgroup *p, *comm = src; + int ret; + + /* for stable cgroup tree */ + lockdep_assert_held(&cgroup_mutex); + + while (!cgroup_is_descendant(dst->css.cgroup, comm->css.cgroup)) + comm = parent_pids(comm); + + ret = pids_try_charge(dst, num, comm, NULL); + + for (p = src; p != comm; p = parent_pids(p)) + pids_cancel(p, num); + return ret; }
static int pids_can_attach(struct cgroup_taskset *tset) @@ -215,8 +231,7 @@ static int pids_can_attach(struct cgroup_taskset *tset) old_css = task_css(task, pids_cgrp_id); old_pids = css_pids(old_css);
- pids_charge(pids, 1); - pids_uncharge(old_pids, 1); + (void) pids_tranfer_charge(old_pids, pids, 1); }
return 0; @@ -235,8 +250,7 @@ static void pids_cancel_attach(struct cgroup_taskset *tset) old_css = task_css(task, pids_cgrp_id); old_pids = css_pids(old_css);
- pids_charge(old_pids, 1); - pids_uncharge(pids, 1); + (void) pids_tranfer_charge(pids, old_pids, 1); } }
@@ -287,7 +301,7 @@ static int pids_can_fork(struct task_struct *task, struct css_set *cset) else css = task_css_check(current, pids_cgrp_id, true); pids = css_pids(css); - err = pids_try_charge(pids, 1, &pids_over_limit); + err = pids_try_charge(pids, 1, NULL, &pids_over_limit); if (err) pids_event(pids, pids_over_limit);
While pids controller is designed with only forks in mind, it leads to situations where limit is apparently ineffective. A manager daemon is in /src and it spawns tasks into /dst. The administrator sets up a limit dst/pids.max while src/pids.max is unlimited. The manager daemon can spawn more than dst/pids.max tasks because they get into their target cgroup via migration (or CLONE_INTO_CGROUP).
For this (migration) to work both src and dst must be in the same resource domain so the manager daemon does not honor the limit which is under its control anyway and no excessive resource consumption happens.
dst/pids.current > dst/pids.max may come as a surprise when the spawning mechanism is opaque to the administrator of dst/pids.max.
Change the behavior of pids controller to take into account limits of target cgroup upon migration (but only below common ancestor src and dst, pids.current of common ancestor and above is not affected by migration, so deliberatly ignore pre-existing pids.current > pids.max).
This change of behavior is hidden behind cgroup2 mount option and the default is unchanged, pids.max won't affect migrations.
Signed-off-by: Michal Koutný mkoutny@suse.com --- Documentation/admin-guide/cgroup-v2.rst | 8 +++++++- include/linux/cgroup-defs.h | 7 ++++++- kernel/cgroup/cgroup.c | 16 +++++++++++++++- kernel/cgroup/pids.c | 8 ++++++-- 4 files changed, 34 insertions(+), 5 deletions(-)
diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst index 5d4c505cae06..d7e721aed584 100644 --- a/Documentation/admin-guide/cgroup-v2.rst +++ b/Documentation/admin-guide/cgroup-v2.rst @@ -239,6 +239,11 @@ cgroup v2 currently supports the following mount options. will not be tracked by the memory controller (even if cgroup v2 is remounted later on).
+ pids_miglimit + Apply pids.max limit also when migrating tasks between cgroups. Only + new destination limit are taken into account, i.e. if subtree has + pids.current > pids.max, migration within that subtree is allowed. +
Organizing Processes and Threads -------------------------------- @@ -2204,7 +2209,8 @@ Organisational operations are not blocked by cgroup policies, so it is possible to have pids.current > pids.max. This can be done by either setting the limit to be smaller than pids.current, or attaching enough processes to the cgroup such that pids.current is larger than -pids.max. However, it is not possible to violate a cgroup PID policy +pids.max (unless pids_miglimit mount options is given). +However, it is not possible to violate a cgroup PID policy through fork() or clone(). These will return -EAGAIN if the creation of a new process would cause a cgroup policy to be violated.
diff --git a/include/linux/cgroup-defs.h b/include/linux/cgroup-defs.h index ea48c861cd36..a99db24b5496 100644 --- a/include/linux/cgroup-defs.h +++ b/include/linux/cgroup-defs.h @@ -119,7 +119,12 @@ enum { /* * Enable hugetlb accounting for the memory controller. */ - CGRP_ROOT_MEMORY_HUGETLB_ACCOUNTING = (1 << 19), + CGRP_ROOT_MEMORY_HUGETLB_ACCOUNTING = (1 << 19), + + /* + * Enforce pids limit upon task migration + */ + CGRP_ROOT_PIDS_MIGRATION_LIMIT = (1 << 20), };
/* cftype->flags */ diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c index a66c088c851c..9aa6428c84c1 100644 --- a/kernel/cgroup/cgroup.c +++ b/kernel/cgroup/cgroup.c @@ -1922,6 +1922,7 @@ enum cgroup2_param { Opt_memory_localevents, Opt_memory_recursiveprot, Opt_memory_hugetlb_accounting, + Opt_pids_miglimit, nr__cgroup2_params };
@@ -1931,6 +1932,7 @@ static const struct fs_parameter_spec cgroup2_fs_parameters[] = { fsparam_flag("memory_localevents", Opt_memory_localevents), fsparam_flag("memory_recursiveprot", Opt_memory_recursiveprot), fsparam_flag("memory_hugetlb_accounting", Opt_memory_hugetlb_accounting), + fsparam_flag("pids_miglimit", Opt_pids_miglimit), {} };
@@ -1960,6 +1962,9 @@ static int cgroup2_parse_param(struct fs_context *fc, struct fs_parameter *param case Opt_memory_hugetlb_accounting: ctx->flags |= CGRP_ROOT_MEMORY_HUGETLB_ACCOUNTING; return 0; + case Opt_pids_miglimit: + ctx->flags |= CGRP_ROOT_PIDS_MIGRATION_LIMIT; + return 0; } return -EINVAL; } @@ -1989,6 +1994,12 @@ static void apply_cgroup_root_flags(unsigned int root_flags) cgrp_dfl_root.flags |= CGRP_ROOT_MEMORY_HUGETLB_ACCOUNTING; else cgrp_dfl_root.flags &= ~CGRP_ROOT_MEMORY_HUGETLB_ACCOUNTING; + + if (root_flags & CGRP_ROOT_PIDS_MIGRATION_LIMIT) + cgrp_dfl_root.flags |= CGRP_ROOT_PIDS_MIGRATION_LIMIT; + else + cgrp_dfl_root.flags &= ~CGRP_ROOT_PIDS_MIGRATION_LIMIT; + } }
@@ -2004,6 +2015,8 @@ static int cgroup_show_options(struct seq_file *seq, struct kernfs_root *kf_root seq_puts(seq, ",memory_recursiveprot"); if (cgrp_dfl_root.flags & CGRP_ROOT_MEMORY_HUGETLB_ACCOUNTING) seq_puts(seq, ",memory_hugetlb_accounting"); + if (cgrp_dfl_root.flags & CGRP_ROOT_PIDS_MIGRATION_LIMIT) + seq_puts(seq, ",pids_miglimit"); return 0; }
@@ -7061,7 +7074,8 @@ static ssize_t features_show(struct kobject *kobj, struct kobj_attribute *attr, "favordynmods\n" "memory_localevents\n" "memory_recursiveprot\n" - "memory_hugetlb_accounting\n"); + "memory_hugetlb_accounting\n" + "pids_miglimit\n"); } static struct kobj_attribute cgroup_features_attr = __ATTR_RO(features);
diff --git a/kernel/cgroup/pids.c b/kernel/cgroup/pids.c index 9df8a209a6e2..4683629b8168 100644 --- a/kernel/cgroup/pids.c +++ b/kernel/cgroup/pids.c @@ -217,6 +217,7 @@ static int pids_can_attach(struct cgroup_taskset *tset) { struct task_struct *task; struct cgroup_subsys_state *dst_css; + int err, ret = 0;
cgroup_taskset_for_each(task, dst_css, tset) { struct pids_cgroup *pids = css_pids(dst_css); @@ -231,10 +232,13 @@ static int pids_can_attach(struct cgroup_taskset *tset) old_css = task_css(task, pids_cgrp_id); old_pids = css_pids(old_css);
- (void) pids_tranfer_charge(old_pids, pids, 1); + err = pids_tranfer_charge(old_pids, pids, 1); + + if (!ret && (cgrp_dfl_root.flags & CGRP_ROOT_PIDS_MIGRATION_LIMIT)) + ret = err; }
- return 0; + return ret; }
static void pids_cancel_attach(struct cgroup_taskset *tset)
This adds a couple of tests to check enforcing of limits in pids controller upon migration. When the new option does not exist, the test is skipped.
Signed-off-by: Michal Koutný mkoutny@suse.com --- tools/testing/selftests/cgroup/test_pids.c | 117 ++++++++++++++++++++- 1 file changed, 116 insertions(+), 1 deletion(-)
diff --git a/tools/testing/selftests/cgroup/test_pids.c b/tools/testing/selftests/cgroup/test_pids.c index c1c3a3965624..a3ad5a495f59 100644 --- a/tools/testing/selftests/cgroup/test_pids.c +++ b/tools/testing/selftests/cgroup/test_pids.c @@ -12,6 +12,8 @@ #include "../kselftest.h" #include "cgroup_util.h"
+static bool has_miglimit; + static int run_success(const char *cgroup, void *arg) { return 0; @@ -69,6 +71,112 @@ static int test_pids_max(const char *root) return ret; }
+/* + * This test checks that pids.max prevents migrating tasks over limit into the + * cgroup. + */ +static int test_pids_max_migration(const char *root) +{ + int ret = KSFT_FAIL; + char *cg_pids; + int pid; + + if (!has_miglimit) + return KSFT_SKIP; + + cg_pids = cg_name(root, "pids_test"); + if (!cg_pids) + goto cleanup; + + if (cg_create(cg_pids)) + goto cleanup; + + if (cg_write(cg_pids, "pids.max", "1")) + goto cleanup; + + pid = cg_run_nowait(cg_pids, run_pause, NULL); + if (pid < 0) + goto cleanup; + + if (cg_enter_current(cg_pids) >= 0) + goto cleanup; + + if (kill(pid, SIGINT)) + goto cleanup; + + ret = KSFT_PASS; + +cleanup: + cg_enter_current(root); + cg_destroy(cg_pids); + free(cg_pids); + + return ret; +} + +/* + * This test checks that pids.max does not prevent migrating existing tasks + * inside subtree. + */ +static int test_pids_max_overcommit(const char *root) +{ + int ret = KSFT_FAIL; + char *cg_parent = NULL, *cg_src = NULL, *cg_dst = NULL; + int pid; + + if (!has_miglimit) + return KSFT_SKIP; + + cg_parent = cg_name(root, "pids_test"); + if (!cg_parent) + goto cleanup; + cg_src = cg_name(cg_parent, "src"); + if (!cg_src) + goto cleanup; + cg_dst = cg_name(cg_parent, "dst"); + if (!cg_dst) + goto cleanup; + + if (cg_create(cg_parent)) + goto cleanup; + if (cg_write(cg_parent, "cgroup.subtree_control", "+pids")) + goto cleanup; + if (cg_create(cg_src)) + goto cleanup; + if (cg_create(cg_dst)) + goto cleanup; + + if (cg_enter_current(cg_src) < 0) + goto cleanup; + + pid = cg_run_nowait(cg_src, run_pause, NULL); + if (pid < 0) + goto cleanup; + + if (cg_write(cg_parent, "pids.max", "1")) + goto cleanup; + + if (cg_enter(cg_dst, pid) < 0) + goto cleanup; + + if (kill(pid, SIGINT)) + goto cleanup; + + ret = KSFT_PASS; + +cleanup: + cg_enter_current(root); + cg_destroy(cg_dst); + cg_destroy(cg_src); + cg_destroy(cg_parent); + free(cg_dst); + free(cg_src); + free(cg_parent); + + return ret; +} + + /* * This test checks that pids.max prevents forking new children above the * specified limit in the cgroup. @@ -145,6 +253,8 @@ struct pids_test { const char *name; } tests[] = { T(test_pids_max), + T(test_pids_max_migration), + T(test_pids_max_overcommit), T(test_pids_events), }; #undef T @@ -152,7 +262,7 @@ struct pids_test { int main(int argc, char **argv) { char root[PATH_MAX]; - int i, ret = EXIT_SUCCESS; + int i, proc_status, ret = EXIT_SUCCESS;
if (cg_find_unified_root(root, sizeof(root))) ksft_exit_skip("cgroup v2 isn't mounted\n"); @@ -168,6 +278,11 @@ int main(int argc, char **argv) if (cg_write(root, "cgroup.subtree_control", "+pids")) ksft_exit_skip("Failed to set pids controller\n");
+ proc_status = proc_mount_contains("pids_miglimit"); + if (proc_status < 0) + ksft_exit_skip("Failed to query cgroup mount option\n"); + has_miglimit = proc_status; + for (i = 0; i < ARRAY_SIZE(tests); i++) { switch (tests[i].fn(root)) { case KSFT_PASS:
linux-kselftest-mirror@lists.linaro.org