From: Joshua Hahn joshua.hahn6@gmail.com
Niced CPU usage is a metric reported in host-level /proc/stat, but is not reported in cgroup-level statistics in cpu.stat. However, when a host contains multiple tasks across different workloads, it becomes difficult to gauage how much of the task is being spent on niced processes based on /proc/stat alone, since host-level metrics do not provide this cgroup-level granularity.
Exposing this metric will allow load balancers to correctly probe the niced CPU metric for each workload, and make more informed decisions when directing higher priority tasks.
Joshua Hahn (2): Tracking cgroup-level niced CPU time Selftests for niced CPU statistics
include/linux/cgroup-defs.h | 1 + kernel/cgroup/rstat.c | 16 ++++- tools/testing/selftests/cgroup/test_cpu.c | 72 +++++++++++++++++++++++ 3 files changed, 86 insertions(+), 3 deletions(-)
From: Joshua Hahn joshua.hahn6@gmail.com
Cgroup-level CPU statistics currently include time spent on user/system processes, but do not include niced CPU time (despite already being tracked). This patch exposes niced CPU time to the userspace, allowing users to get a better understanding of their hardware limits and can facilitate better load-balancing.
A new field 'ntime' is added to struct cgroup_base_stat as opposed to struct task_cputime to minimize footprint. --- include/linux/cgroup-defs.h | 1 + kernel/cgroup/rstat.c | 16 +++++++++++++--- 2 files changed, 14 insertions(+), 3 deletions(-)
diff --git a/include/linux/cgroup-defs.h b/include/linux/cgroup-defs.h index ae04035b6cbe..a2fcb3db6c52 100644 --- a/include/linux/cgroup-defs.h +++ b/include/linux/cgroup-defs.h @@ -315,6 +315,7 @@ struct cgroup_base_stat { #ifdef CONFIG_SCHED_CORE u64 forceidle_sum; #endif + u64 ntime; };
/* diff --git a/kernel/cgroup/rstat.c b/kernel/cgroup/rstat.c index a06b45272411..a77ba9a83bab 100644 --- a/kernel/cgroup/rstat.c +++ b/kernel/cgroup/rstat.c @@ -444,6 +444,7 @@ static void cgroup_base_stat_add(struct cgroup_base_stat *dst_bstat, #ifdef CONFIG_SCHED_CORE dst_bstat->forceidle_sum += src_bstat->forceidle_sum; #endif + dst_bstat->ntime += src_bstat->ntime; }
static void cgroup_base_stat_sub(struct cgroup_base_stat *dst_bstat, @@ -455,6 +456,7 @@ static void cgroup_base_stat_sub(struct cgroup_base_stat *dst_bstat, #ifdef CONFIG_SCHED_CORE dst_bstat->forceidle_sum -= src_bstat->forceidle_sum; #endif + dst_bstat->ntime -= src_bstat->ntime; }
static void cgroup_base_stat_flush(struct cgroup *cgrp, int cpu) @@ -535,7 +537,10 @@ void __cgroup_account_cputime_field(struct cgroup *cgrp,
switch (index) { case CPUTIME_USER: + rstatc->bstat.cputime.utime += delta_exec; + break; case CPUTIME_NICE: + rstatc->bstat.ntime += delta_exec; rstatc->bstat.cputime.utime += delta_exec; break; case CPUTIME_SYSTEM: @@ -591,6 +596,7 @@ static void root_cgroup_cputime(struct cgroup_base_stat *bstat) #ifdef CONFIG_SCHED_CORE bstat->forceidle_sum += cpustat[CPUTIME_FORCEIDLE]; #endif + bstat->ntime += cpustat[CPUTIME_NICE]; } }
@@ -608,13 +614,14 @@ static void cgroup_force_idle_show(struct seq_file *seq, struct cgroup_base_stat void cgroup_base_stat_cputime_show(struct seq_file *seq) { struct cgroup *cgrp = seq_css(seq)->cgroup; - u64 usage, utime, stime; + u64 usage, utime, stime, ntime;
if (cgroup_parent(cgrp)) { cgroup_rstat_flush_hold(cgrp); usage = cgrp->bstat.cputime.sum_exec_runtime; cputime_adjust(&cgrp->bstat.cputime, &cgrp->prev_cputime, &utime, &stime); + ntime = cgrp->bstat.ntime; cgroup_rstat_flush_release(cgrp); } else { /* cgrp->bstat of root is not actually used, reuse it */ @@ -622,16 +629,19 @@ void cgroup_base_stat_cputime_show(struct seq_file *seq) usage = cgrp->bstat.cputime.sum_exec_runtime; utime = cgrp->bstat.cputime.utime; stime = cgrp->bstat.cputime.stime; + ntime = cgrp->bstat.ntime; }
do_div(usage, NSEC_PER_USEC); do_div(utime, NSEC_PER_USEC); do_div(stime, NSEC_PER_USEC); + do_div(ntime, NSEC_PER_USEC);
seq_printf(seq, "usage_usec %llu\n" "user_usec %llu\n" - "system_usec %llu\n", - usage, utime, stime); + "system_usec %llu\n" + "nice_usec %llu\n", + usage, utime, stime, ntime);
cgroup_force_idle_show(seq, &cgrp->bstat); }
Hello,
On Fri, Aug 23, 2024 at 01:05:17PM -0700, JoshuaHahnjoshua.hahn6@gmail.com wrote:
From: Joshua Hahn joshua.hahn6@gmail.com
Cgroup-level CPU statistics currently include time spent on user/system processes, but do not include niced CPU time (despite already being tracked). This patch exposes niced CPU time to the userspace, allowing users to get a better understanding of their hardware limits and can facilitate better load-balancing.
You aren't talking about the in-kernel scheduler's load balancer, right? If so, can you please update the description? This is a bit too confusing for a commit message for a kernel commit.
Thanks.
Hello, thank you for reviewing the patch.
Cgroup-level CPU statistics currently include time spent on user/system processes, but do not include niced CPU time (despite already being tracked). This patch exposes niced CPU time to the userspace, allowing users to get a better understanding of their hardware limits and can facilitate better load-balancing.
You aren't talking about the in-kernel scheduler's load balancer, right? If so, can you please update the description? This is a bit too confusing for a commit message for a kernel commit.
Thank you for pointing this out -- I'll edit the commit message to the following in a v2:
Cgroup-level CPU statistics currently include time spent on user/system processes, but do not include niced CPU time (despite already being tracked). This patch exposes niced CPU time to the userspace, allowing users to get a better understanding of their hardware limits and can facilitate more informed workload distribution.
Thanks, Joshua
From: Joshua Hahn joshua.hahn6@gmail.com
Creates a cgroup with a single nice CPU hog process running. fork() is called to generate the nice process because un-nicing is not possible (see man nice(3)). If fork() was not used to generate the CPU hog, we would run the rest of the cgroup selftest suite as a nice process. --- tools/testing/selftests/cgroup/test_cpu.c | 72 +++++++++++++++++++++++ 1 file changed, 72 insertions(+)
diff --git a/tools/testing/selftests/cgroup/test_cpu.c b/tools/testing/selftests/cgroup/test_cpu.c index dad2ed82f3ef..cd5550391f49 100644 --- a/tools/testing/selftests/cgroup/test_cpu.c +++ b/tools/testing/selftests/cgroup/test_cpu.c @@ -8,6 +8,7 @@ #include <pthread.h> #include <stdio.h> #include <time.h> +#include <unistd.h>
#include "../kselftest.h" #include "cgroup_util.h" @@ -229,6 +230,76 @@ static int test_cpucg_stats(const char *root) return ret; }
+/* + * Creates a nice process that consumes CPU and checks that the elapsed + * usertime in the cgroup is close to the expected time. + */ +static int test_cpucg_nice(const char *root) +{ + int ret = KSFT_FAIL; + int status; + long user_usec, nice_usec; + long usage_seconds = 2; + long expected_nice_usec = usage_seconds * USEC_PER_SEC; + char *cpucg; + pid_t pid; + + cpucg = cg_name(root, "cpucg_test"); + if (!cpucg) + goto cleanup; + + if (cg_create(cpucg)) + goto cleanup; + + user_usec = cg_read_key_long(cpucg, "cpu.stat", "user_usec"); + nice_usec = cg_read_key_long(cpucg, "cpu.stat", "nice_usec"); + if (user_usec != 0 || nice_usec != 0) + goto cleanup; + + /* + * We fork here to create a new process that can be niced without + * polluting the nice value of other selftests + */ + pid = fork(); + if (pid < 0) { + goto cleanup; + } else if (pid == 0) { + struct cpu_hog_func_param param = { + .nprocs = 1, + .ts = { + .tv_sec = usage_seconds, + .tv_nsec = 0, + }, + .clock_type = CPU_HOG_CLOCK_PROCESS, + }; + + /* Try to keep niced CPU usage as constrained to hog_cpu as possible */ + nice(1); + cg_run(cpucg, hog_cpus_timed, (void *)¶m); + exit(0); + } else { + waitpid(pid, &status, 0); + if (!WIFEXITED(status)) + goto cleanup; + + user_usec = cg_read_key_long(cpucg, "cpu.stat", "user_usec"); + nice_usec = cg_read_key_long(cpucg, "cpu.stat", "nice_usec"); + if (nice_usec > user_usec || user_usec <= 0) + goto cleanup; + + if (!values_close(nice_usec, expected_nice_usec, 1)) + goto cleanup; + + ret = KSFT_PASS; + } + +cleanup: + cg_destroy(cpucg); + free(cpucg); + + return ret; +} + static int run_cpucg_weight_test( const char *root, @@ -686,6 +757,7 @@ struct cpucg_test { } tests[] = { T(test_cpucg_subtree_control), T(test_cpucg_stats), + T(test_cpucg_nice), T(test_cpucg_weight_overprovisioned), T(test_cpucg_weight_underprovisioned), T(test_cpucg_nested_weight_overprovisioned),
Hello.
On Fri, Aug 23, 2024 at 01:05:16PM GMT, JoshuaHahnjoshua.hahn6@gmail.com wrote:
Niced CPU usage is a metric reported in host-level /proc/stat, but is not reported in cgroup-level statistics in cpu.stat. However, when a host contains multiple tasks across different workloads, it becomes difficult to gauage how much of the task is being spent on niced processes based on /proc/stat alone, since host-level metrics do not provide this cgroup-level granularity.
The difference between the two metrics is in cputime.c: index = (task_nice(p) > 0) ? CPUTIME_NICE : CPUTIME_USER;
Exposing this metric will allow load balancers to correctly probe the niced CPU metric for each workload, and make more informed decisions when directing higher priority tasks.
How would this work? (E.g. if too little nice time -> reduce priority of high prio tasks?)
Thanks, Michal
Hello, thank you for reviewing the patch.
On Mon, Aug 26, 2024 at 10:43 AM Michal Koutný mkoutny@suse.com wrote:
The difference between the two metrics is in cputime.c: index = (task_nice(p) > 0) ? CPUTIME_NICE : CPUTIME_USER;
Exposing this metric will allow load balancers to correctly probe the niced CPU metric for each workload, and make more informed decisions when directing higher priority tasks.
How would this work? (E.g. if too little nice time -> reduce priority of high prio tasks?)
We can find what fraction of the task is being run as a nice process by dividing the two metrics (nice / user) and determining the fraction of niceness. When a high prio task comes into the load balancer and must decide where the task should be delegated to, the balancer can use how much of the task is nice as one factor in making the decision.
The reverse is also true; host-level information in /proc/stat may indicate that a high percentage of CPU time is being used by nice processes, giving an illusion that all tasks within the host are running nice processes, when in reality, it is just one task that is using a lot of nice CPU time, and other tasks are running nonnice tasks. By including cgroup-level nice statistics, we can get a clearer picture and avoid overloading a host with too many high prio tasks.
Like you suggested, this information can also help in re-prioritizing the processes, which may help high prio tasks become executed quicker.
Thanks, Joshua
linux-kselftest-mirror@lists.linaro.org