Hi Greg,
This series fixes issues we've seen with softirq time accounting in 4.9: - when ksoftirqd is running at 100% on a CPU, none of the values reported by /proc/stat for that CPU will change, sometimes for dozens of seconds, - large deviations in the total number of ticks accumulated over a fixed time for a CPU, probably because of the first issue hitting for shorter periods.
We found out that something pretty similar had been reported 9 months ago, see the reference link below. In that discussion, Rabin Vincent had made a 4.9 specific patch which fixes our first issue, but we were still seeing some deviation from the total number of ticks (up to 1.7% from expected, where we had only 0.2% on older kernels), and you had also asked for a direct backport from the mainline series, if possible.
As mentioned in that thread, a lot of changes (probably 50+) went into 4.11 to remove cputime, but we could get something working with only the 4 attached patches to fix these two issues. Three of these patches apply without change, and the second one in the series ("sched/cputime: Convert kcpustat to nsecs") needed a minor change as a cast had been added in 527b0a76f41d ("sched/cpuacct: Avoid %lld seq_printf warning") to fix a build warning on s390. I guess we could also include that patch in this series, let me know if this is the preferred way to handle this.
We ran our tests on 3.18, 4.4 and 4.9 and confirmed that only 4.9 would need this series, and that this series indeed restores the behavior we were seeing on those older kernels.
Thanks!
Reference: http://lkml.kernel.org/r/%3C1513159876-5125-1-git-send-email-rabin.vincent@a...
Frederic Weisbecker (4): time: Introduce jiffies64_to_nsecs() sched/cputime: Convert kcpustat to nsecs sched/cputime: Increment kcpustat directly on irqtime account sched/cputime: Fix ksoftirqd cputime accounting regression
arch/s390/appldata/appldata_os.c | 16 +++---- drivers/cpufreq/cpufreq.c | 6 +-- drivers/cpufreq/cpufreq_governor.c | 2 +- drivers/cpufreq/cpufreq_stats.c | 1 - drivers/macintosh/rack-meter.c | 2 +- fs/proc/stat.c | 68 +++++++++++++-------------- fs/proc/uptime.c | 7 +-- include/linux/jiffies.h | 2 + kernel/sched/cpuacct.c | 2 +- kernel/sched/cputime.c | 75 +++++++++++++----------------- kernel/sched/sched.h | 12 +++-- kernel/time/time.c | 10 ++++ kernel/time/timeconst.bc | 6 +++ 13 files changed, 109 insertions(+), 100 deletions(-)
From: Frederic Weisbecker fweisbec@gmail.com
commit 07e5f5e353aaa61696c8353d87050994a0c4648a upstream.
This will be needed for the cputime_t to nsec conversion.
Signed-off-by: Frederic Weisbecker fweisbec@gmail.com Cc: Benjamin Herrenschmidt benh@kernel.crashing.org Cc: Paul Mackerras paulus@samba.org Cc: Michael Ellerman mpe@ellerman.id.au Cc: Heiko Carstens heiko.carstens@de.ibm.com Cc: Martin Schwidefsky schwidefsky@de.ibm.com Cc: Tony Luck tony.luck@intel.com Cc: Fenghua Yu fenghua.yu@intel.com Cc: Peter Zijlstra peterz@infradead.org Cc: Rik van Riel riel@redhat.com Cc: Stanislaw Gruszka sgruszka@redhat.com Cc: Wanpeng Li wanpeng.li@hotmail.com Link: http://lkml.kernel.org/r/1485832191-26889-2-git-send-email-fweisbec@gmail.co... Signed-off-by: Ingo Molnar mingo@kernel.org Signed-off-by: Ivan Delalande colona@arista.com --- include/linux/jiffies.h | 2 ++ kernel/time/time.c | 10 ++++++++++ kernel/time/timeconst.bc | 6 ++++++ 3 files changed, 18 insertions(+)
diff --git a/include/linux/jiffies.h b/include/linux/jiffies.h index c2a0f0072274..734377ad42e9 100644 --- a/include/linux/jiffies.h +++ b/include/linux/jiffies.h @@ -292,6 +292,8 @@ static inline u64 jiffies_to_nsecs(const unsigned long j) return (u64)jiffies_to_usecs(j) * NSEC_PER_USEC; }
+extern u64 jiffies64_to_nsecs(u64 j); + extern unsigned long __msecs_to_jiffies(const unsigned int m); #if HZ <= MSEC_PER_SEC && !(MSEC_PER_SEC % HZ) /* diff --git a/kernel/time/time.c b/kernel/time/time.c index 39468651a064..a5b6d98ea7b1 100644 --- a/kernel/time/time.c +++ b/kernel/time/time.c @@ -704,6 +704,16 @@ u64 nsec_to_clock_t(u64 x) #endif }
+u64 jiffies64_to_nsecs(u64 j) +{ +#if !(NSEC_PER_SEC % HZ) + return (NSEC_PER_SEC / HZ) * j; +# else + return div_u64(j * HZ_TO_NSEC_NUM, HZ_TO_NSEC_DEN); +#endif +} +EXPORT_SYMBOL(jiffies64_to_nsecs); + /** * nsecs_to_jiffies64 - Convert nsecs in u64 to jiffies64 * diff --git a/kernel/time/timeconst.bc b/kernel/time/timeconst.bc index c48688904f9f..f83bbb81600b 100644 --- a/kernel/time/timeconst.bc +++ b/kernel/time/timeconst.bc @@ -98,6 +98,12 @@ define timeconst(hz) { print "#define HZ_TO_USEC_DEN\t\t", hz/cd, "\n" print "#define USEC_TO_HZ_NUM\t\t", hz/cd, "\n" print "#define USEC_TO_HZ_DEN\t\t", 1000000/cd, "\n" + + cd=gcd(hz,1000000000) + print "#define HZ_TO_NSEC_NUM\t\t", 1000000000/cd, "\n" + print "#define HZ_TO_NSEC_DEN\t\t", hz/cd, "\n" + print "#define NSEC_TO_HZ_NUM\t\t", hz/cd, "\n" + print "#define NSEC_TO_HZ_DEN\t\t", 1000000000/cd, "\n" print "\n"
print "#endif /* KERNEL_TIMECONST_H */\n"
From: Frederic Weisbecker fweisbec@gmail.com
commit 7fb1327ee9b92fca27662f9b9d60c7c3376d6c69 upstream.
Kernel CPU stats are stored in cputime_t which is an architecture defined type, and hence a bit opaque and requiring accessors and mutators for any operation.
Converting them to nsecs simplifies the code and is one step toward the removal of cputime_t in the core code.
Signed-off-by: Frederic Weisbecker fweisbec@gmail.com Cc: Benjamin Herrenschmidt benh@kernel.crashing.org Cc: Paul Mackerras paulus@samba.org Cc: Michael Ellerman mpe@ellerman.id.au Cc: Heiko Carstens heiko.carstens@de.ibm.com Cc: Martin Schwidefsky schwidefsky@de.ibm.com Cc: Tony Luck tony.luck@intel.com Cc: Fenghua Yu fenghua.yu@intel.com Cc: Peter Zijlstra peterz@infradead.org Cc: Rik van Riel riel@redhat.com Cc: Stanislaw Gruszka sgruszka@redhat.com Cc: Wanpeng Li wanpeng.li@hotmail.com Link: http://lkml.kernel.org/r/1485832191-26889-4-git-send-email-fweisbec@gmail.co... Signed-off-by: Ingo Molnar mingo@kernel.org [colona: minor conflict as 527b0a76f41d ("sched/cpuacct: Avoid %lld seq_printf warning") is missing from v4.9] Signed-off-by: Ivan Delalande colona@arista.com --- arch/s390/appldata/appldata_os.c | 16 +++---- drivers/cpufreq/cpufreq.c | 6 +-- drivers/cpufreq/cpufreq_governor.c | 2 +- drivers/cpufreq/cpufreq_stats.c | 1 - drivers/macintosh/rack-meter.c | 2 +- fs/proc/stat.c | 68 +++++++++++++++--------------- fs/proc/uptime.c | 7 +-- kernel/sched/cpuacct.c | 2 +- kernel/sched/cputime.c | 22 +++++----- 9 files changed, 61 insertions(+), 65 deletions(-)
diff --git a/arch/s390/appldata/appldata_os.c b/arch/s390/appldata/appldata_os.c index 69b23b25ac34..08b9e942a262 100644 --- a/arch/s390/appldata/appldata_os.c +++ b/arch/s390/appldata/appldata_os.c @@ -113,21 +113,21 @@ static void appldata_get_os_data(void *data) j = 0; for_each_online_cpu(i) { os_data->os_cpu[j].per_cpu_user = - cputime_to_jiffies(kcpustat_cpu(i).cpustat[CPUTIME_USER]); + nsecs_to_jiffies(kcpustat_cpu(i).cpustat[CPUTIME_USER]); os_data->os_cpu[j].per_cpu_nice = - cputime_to_jiffies(kcpustat_cpu(i).cpustat[CPUTIME_NICE]); + nsecs_to_jiffies(kcpustat_cpu(i).cpustat[CPUTIME_NICE]); os_data->os_cpu[j].per_cpu_system = - cputime_to_jiffies(kcpustat_cpu(i).cpustat[CPUTIME_SYSTEM]); + nsecs_to_jiffies(kcpustat_cpu(i).cpustat[CPUTIME_SYSTEM]); os_data->os_cpu[j].per_cpu_idle = - cputime_to_jiffies(kcpustat_cpu(i).cpustat[CPUTIME_IDLE]); + nsecs_to_jiffies(kcpustat_cpu(i).cpustat[CPUTIME_IDLE]); os_data->os_cpu[j].per_cpu_irq = - cputime_to_jiffies(kcpustat_cpu(i).cpustat[CPUTIME_IRQ]); + nsecs_to_jiffies(kcpustat_cpu(i).cpustat[CPUTIME_IRQ]); os_data->os_cpu[j].per_cpu_softirq = - cputime_to_jiffies(kcpustat_cpu(i).cpustat[CPUTIME_SOFTIRQ]); + nsecs_to_jiffies(kcpustat_cpu(i).cpustat[CPUTIME_SOFTIRQ]); os_data->os_cpu[j].per_cpu_iowait = - cputime_to_jiffies(kcpustat_cpu(i).cpustat[CPUTIME_IOWAIT]); + nsecs_to_jiffies(kcpustat_cpu(i).cpustat[CPUTIME_IOWAIT]); os_data->os_cpu[j].per_cpu_steal = - cputime_to_jiffies(kcpustat_cpu(i).cpustat[CPUTIME_STEAL]); + nsecs_to_jiffies(kcpustat_cpu(i).cpustat[CPUTIME_STEAL]); os_data->os_cpu[j].cpu_id = i; j++; } diff --git a/drivers/cpufreq/cpufreq.c b/drivers/cpufreq/cpufreq.c index af5eff6835a8..d6d91e8afa9e 100644 --- a/drivers/cpufreq/cpufreq.c +++ b/drivers/cpufreq/cpufreq.c @@ -132,7 +132,7 @@ static inline u64 get_cpu_idle_time_jiffy(unsigned int cpu, u64 *wall) u64 cur_wall_time; u64 busy_time;
- cur_wall_time = jiffies64_to_cputime64(get_jiffies_64()); + cur_wall_time = jiffies64_to_nsecs(get_jiffies_64());
busy_time = kcpustat_cpu(cpu).cpustat[CPUTIME_USER]; busy_time += kcpustat_cpu(cpu).cpustat[CPUTIME_SYSTEM]; @@ -143,9 +143,9 @@ static inline u64 get_cpu_idle_time_jiffy(unsigned int cpu, u64 *wall)
idle_time = cur_wall_time - busy_time; if (wall) - *wall = cputime_to_usecs(cur_wall_time); + *wall = div_u64(cur_wall_time, NSEC_PER_USEC);
- return cputime_to_usecs(idle_time); + return div_u64(idle_time, NSEC_PER_USEC); }
u64 get_cpu_idle_time(unsigned int cpu, u64 *wall, int io_busy) diff --git a/drivers/cpufreq/cpufreq_governor.c b/drivers/cpufreq/cpufreq_governor.c index 642dd0f183a8..38d1a8216084 100644 --- a/drivers/cpufreq/cpufreq_governor.c +++ b/drivers/cpufreq/cpufreq_governor.c @@ -152,7 +152,7 @@ unsigned int dbs_update(struct cpufreq_policy *policy) if (ignore_nice) { u64 cur_nice = kcpustat_cpu(j).cpustat[CPUTIME_NICE];
- idle_time += cputime_to_usecs(cur_nice - j_cdbs->prev_cpu_nice); + idle_time += div_u64(cur_nice - j_cdbs->prev_cpu_nice, NSEC_PER_USEC); j_cdbs->prev_cpu_nice = cur_nice; }
diff --git a/drivers/cpufreq/cpufreq_stats.c b/drivers/cpufreq/cpufreq_stats.c index 06d3abdffd3a..b084708fd113 100644 --- a/drivers/cpufreq/cpufreq_stats.c +++ b/drivers/cpufreq/cpufreq_stats.c @@ -13,7 +13,6 @@ #include <linux/cpufreq.h> #include <linux/module.h> #include <linux/slab.h> -#include <linux/cputime.h>
static DEFINE_SPINLOCK(cpufreq_stats_lock);
diff --git a/drivers/macintosh/rack-meter.c b/drivers/macintosh/rack-meter.c index 25852e399ab2..13b16f34256c 100644 --- a/drivers/macintosh/rack-meter.c +++ b/drivers/macintosh/rack-meter.c @@ -91,7 +91,7 @@ static inline cputime64_t get_cpu_idle_time(unsigned int cpu) if (rackmeter_ignore_nice) retval += kcpustat_cpu(cpu).cpustat[CPUTIME_NICE];
- return retval; + return nsecs_to_cputime64(retval); }
static void rackmeter_setup_i2s(struct rackmeter *rm) diff --git a/fs/proc/stat.c b/fs/proc/stat.c index d700c42b3572..44475a44cbf1 100644 --- a/fs/proc/stat.c +++ b/fs/proc/stat.c @@ -21,23 +21,23 @@
#ifdef arch_idle_time
-static cputime64_t get_idle_time(int cpu) +static u64 get_idle_time(int cpu) { - cputime64_t idle; + u64 idle;
idle = kcpustat_cpu(cpu).cpustat[CPUTIME_IDLE]; if (cpu_online(cpu) && !nr_iowait_cpu(cpu)) - idle += arch_idle_time(cpu); + idle += cputime_to_nsecs(arch_idle_time(cpu)); return idle; }
-static cputime64_t get_iowait_time(int cpu) +static u64 get_iowait_time(int cpu) { - cputime64_t iowait; + u64 iowait;
iowait = kcpustat_cpu(cpu).cpustat[CPUTIME_IOWAIT]; if (cpu_online(cpu) && nr_iowait_cpu(cpu)) - iowait += arch_idle_time(cpu); + iowait += cputime_to_nsecs(arch_idle_time(cpu)); return iowait; }
@@ -45,32 +45,32 @@ static cputime64_t get_iowait_time(int cpu)
static u64 get_idle_time(int cpu) { - u64 idle, idle_time = -1ULL; + u64 idle, idle_usecs = -1ULL;
if (cpu_online(cpu)) - idle_time = get_cpu_idle_time_us(cpu, NULL); + idle_usecs = get_cpu_idle_time_us(cpu, NULL);
- if (idle_time == -1ULL) + if (idle_usecs == -1ULL) /* !NO_HZ or cpu offline so we can rely on cpustat.idle */ idle = kcpustat_cpu(cpu).cpustat[CPUTIME_IDLE]; else - idle = usecs_to_cputime64(idle_time); + idle = idle_usecs * NSEC_PER_USEC;
return idle; }
static u64 get_iowait_time(int cpu) { - u64 iowait, iowait_time = -1ULL; + u64 iowait, iowait_usecs = -1ULL;
if (cpu_online(cpu)) - iowait_time = get_cpu_iowait_time_us(cpu, NULL); + iowait_usecs = get_cpu_iowait_time_us(cpu, NULL);
- if (iowait_time == -1ULL) + if (iowait_usecs == -1ULL) /* !NO_HZ or cpu offline so we can rely on cpustat.iowait */ iowait = kcpustat_cpu(cpu).cpustat[CPUTIME_IOWAIT]; else - iowait = usecs_to_cputime64(iowait_time); + iowait = iowait_usecs * NSEC_PER_USEC;
return iowait; } @@ -115,16 +115,16 @@ static int show_stat(struct seq_file *p, void *v) } sum += arch_irq_stat();
- seq_put_decimal_ull(p, "cpu ", cputime64_to_clock_t(user)); - seq_put_decimal_ull(p, " ", cputime64_to_clock_t(nice)); - seq_put_decimal_ull(p, " ", cputime64_to_clock_t(system)); - seq_put_decimal_ull(p, " ", cputime64_to_clock_t(idle)); - seq_put_decimal_ull(p, " ", cputime64_to_clock_t(iowait)); - seq_put_decimal_ull(p, " ", cputime64_to_clock_t(irq)); - seq_put_decimal_ull(p, " ", cputime64_to_clock_t(softirq)); - seq_put_decimal_ull(p, " ", cputime64_to_clock_t(steal)); - seq_put_decimal_ull(p, " ", cputime64_to_clock_t(guest)); - seq_put_decimal_ull(p, " ", cputime64_to_clock_t(guest_nice)); + seq_put_decimal_ull(p, "cpu ", nsec_to_clock_t(user)); + seq_put_decimal_ull(p, " ", nsec_to_clock_t(nice)); + seq_put_decimal_ull(p, " ", nsec_to_clock_t(system)); + seq_put_decimal_ull(p, " ", nsec_to_clock_t(idle)); + seq_put_decimal_ull(p, " ", nsec_to_clock_t(iowait)); + seq_put_decimal_ull(p, " ", nsec_to_clock_t(irq)); + seq_put_decimal_ull(p, " ", nsec_to_clock_t(softirq)); + seq_put_decimal_ull(p, " ", nsec_to_clock_t(steal)); + seq_put_decimal_ull(p, " ", nsec_to_clock_t(guest)); + seq_put_decimal_ull(p, " ", nsec_to_clock_t(guest_nice)); seq_putc(p, '\n');
for_each_online_cpu(i) { @@ -140,16 +140,16 @@ static int show_stat(struct seq_file *p, void *v) guest = kcpustat_cpu(i).cpustat[CPUTIME_GUEST]; guest_nice = kcpustat_cpu(i).cpustat[CPUTIME_GUEST_NICE]; seq_printf(p, "cpu%d", i); - seq_put_decimal_ull(p, " ", cputime64_to_clock_t(user)); - seq_put_decimal_ull(p, " ", cputime64_to_clock_t(nice)); - seq_put_decimal_ull(p, " ", cputime64_to_clock_t(system)); - seq_put_decimal_ull(p, " ", cputime64_to_clock_t(idle)); - seq_put_decimal_ull(p, " ", cputime64_to_clock_t(iowait)); - seq_put_decimal_ull(p, " ", cputime64_to_clock_t(irq)); - seq_put_decimal_ull(p, " ", cputime64_to_clock_t(softirq)); - seq_put_decimal_ull(p, " ", cputime64_to_clock_t(steal)); - seq_put_decimal_ull(p, " ", cputime64_to_clock_t(guest)); - seq_put_decimal_ull(p, " ", cputime64_to_clock_t(guest_nice)); + seq_put_decimal_ull(p, " ", nsec_to_clock_t(user)); + seq_put_decimal_ull(p, " ", nsec_to_clock_t(nice)); + seq_put_decimal_ull(p, " ", nsec_to_clock_t(system)); + seq_put_decimal_ull(p, " ", nsec_to_clock_t(idle)); + seq_put_decimal_ull(p, " ", nsec_to_clock_t(iowait)); + seq_put_decimal_ull(p, " ", nsec_to_clock_t(irq)); + seq_put_decimal_ull(p, " ", nsec_to_clock_t(softirq)); + seq_put_decimal_ull(p, " ", nsec_to_clock_t(steal)); + seq_put_decimal_ull(p, " ", nsec_to_clock_t(guest)); + seq_put_decimal_ull(p, " ", nsec_to_clock_t(guest_nice)); seq_putc(p, '\n'); } seq_put_decimal_ull(p, "intr ", (unsigned long long)sum); diff --git a/fs/proc/uptime.c b/fs/proc/uptime.c index 33de567c25af..7981c4ffe787 100644 --- a/fs/proc/uptime.c +++ b/fs/proc/uptime.c @@ -5,23 +5,20 @@ #include <linux/seq_file.h> #include <linux/time.h> #include <linux/kernel_stat.h> -#include <linux/cputime.h>
static int uptime_proc_show(struct seq_file *m, void *v) { struct timespec uptime; struct timespec idle; - u64 idletime; u64 nsec; u32 rem; int i;
- idletime = 0; + nsec = 0; for_each_possible_cpu(i) - idletime += (__force u64) kcpustat_cpu(i).cpustat[CPUTIME_IDLE]; + nsec += (__force u64) kcpustat_cpu(i).cpustat[CPUTIME_IDLE];
get_monotonic_boottime(&uptime); - nsec = cputime64_to_jiffies64(idletime) * TICK_NSEC; idle.tv_sec = div_u64_rem(nsec, NSEC_PER_SEC, &rem); idle.tv_nsec = rem; seq_printf(m, "%lu.%02lu %lu.%02lu\n", diff --git a/kernel/sched/cpuacct.c b/kernel/sched/cpuacct.c index bc0b309c3f19..4c882791c10c 100644 --- a/kernel/sched/cpuacct.c +++ b/kernel/sched/cpuacct.c @@ -297,7 +297,7 @@ static int cpuacct_stats_show(struct seq_file *sf, void *v) for (stat = 0; stat < CPUACCT_STAT_NSTATS; stat++) { seq_printf(sf, "%s %lld\n", cpuacct_stat_desc[stat], - cputime64_to_clock_t(val[stat])); + nsec_to_clock_t(val[stat])); }
return 0; diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c index 5ebee3164e64..c07f0cfea07e 100644 --- a/kernel/sched/cputime.c +++ b/kernel/sched/cputime.c @@ -75,9 +75,9 @@ static cputime_t irqtime_account_update(u64 irqtime, int idx, cputime_t maxtime) u64 *cpustat = kcpustat_this_cpu->cpustat; cputime_t irq_cputime;
- irq_cputime = nsecs_to_cputime64(irqtime) - cpustat[idx]; + irq_cputime = nsecs_to_cputime64(irqtime - cpustat[idx]); irq_cputime = min(irq_cputime, maxtime); - cpustat[idx] += irq_cputime; + cpustat[idx] += cputime_to_nsecs(irq_cputime);
return irq_cputime; } @@ -143,7 +143,7 @@ void account_user_time(struct task_struct *p, cputime_t cputime, index = (task_nice(p) > 0) ? CPUTIME_NICE : CPUTIME_USER;
/* Add user time to cpustat. */ - task_group_account_field(p, index, (__force u64) cputime); + task_group_account_field(p, index, cputime_to_nsecs(cputime));
/* Account for user time used */ acct_account_cputime(p); @@ -168,11 +168,11 @@ static void account_guest_time(struct task_struct *p, cputime_t cputime,
/* Add guest time to cpustat. */ if (task_nice(p) > 0) { - cpustat[CPUTIME_NICE] += (__force u64) cputime; - cpustat[CPUTIME_GUEST_NICE] += (__force u64) cputime; + cpustat[CPUTIME_NICE] += cputime_to_nsecs(cputime); + cpustat[CPUTIME_GUEST_NICE] += cputime_to_nsecs(cputime); } else { - cpustat[CPUTIME_USER] += (__force u64) cputime; - cpustat[CPUTIME_GUEST] += (__force u64) cputime; + cpustat[CPUTIME_USER] += cputime_to_nsecs(cputime); + cpustat[CPUTIME_GUEST] += cputime_to_nsecs(cputime); } }
@@ -193,7 +193,7 @@ void __account_system_time(struct task_struct *p, cputime_t cputime, account_group_system_time(p, cputime);
/* Add system time to cpustat. */ - task_group_account_field(p, index, (__force u64) cputime); + task_group_account_field(p, index, cputime_to_nsecs(cputime));
/* Account for system time used */ acct_account_cputime(p); @@ -234,7 +234,7 @@ void account_steal_time(cputime_t cputime) { u64 *cpustat = kcpustat_this_cpu->cpustat;
- cpustat[CPUTIME_STEAL] += (__force u64) cputime; + cpustat[CPUTIME_STEAL] += cputime_to_nsecs(cputime); }
/* @@ -247,9 +247,9 @@ void account_idle_time(cputime_t cputime) struct rq *rq = this_rq();
if (atomic_read(&rq->nr_iowait) > 0) - cpustat[CPUTIME_IOWAIT] += (__force u64) cputime; + cpustat[CPUTIME_IOWAIT] += cputime_to_nsecs(cputime); else - cpustat[CPUTIME_IDLE] += (__force u64) cputime; + cpustat[CPUTIME_IDLE] += cputime_to_nsecs(cputime); }
/*
On Thu, Sep 06, 2018 at 06:42:16PM -0700, Ivan Delalande wrote:
From: Frederic Weisbecker fweisbec@gmail.com
commit 7fb1327ee9b92fca27662f9b9d60c7c3376d6c69 upstream.
Kernel CPU stats are stored in cputime_t which is an architecture defined type, and hence a bit opaque and requiring accessors and mutators for any operation.
Converting them to nsecs simplifies the code and is one step toward the removal of cputime_t in the core code.
Signed-off-by: Frederic Weisbecker fweisbec@gmail.com Cc: Benjamin Herrenschmidt benh@kernel.crashing.org Cc: Paul Mackerras paulus@samba.org Cc: Michael Ellerman mpe@ellerman.id.au Cc: Heiko Carstens heiko.carstens@de.ibm.com Cc: Martin Schwidefsky schwidefsky@de.ibm.com Cc: Tony Luck tony.luck@intel.com Cc: Fenghua Yu fenghua.yu@intel.com Cc: Peter Zijlstra peterz@infradead.org Cc: Rik van Riel riel@redhat.com Cc: Stanislaw Gruszka sgruszka@redhat.com Cc: Wanpeng Li wanpeng.li@hotmail.com Link: http://lkml.kernel.org/r/1485832191-26889-4-git-send-email-fweisbec@gmail.co... Signed-off-by: Ingo Molnar mingo@kernel.org [colona: minor conflict as 527b0a76f41d ("sched/cpuacct: Avoid %lld seq_printf warning") is missing from v4.9] Signed-off-by: Ivan Delalande colona@arista.com
arch/s390/appldata/appldata_os.c | 16 +++---- drivers/cpufreq/cpufreq.c | 6 +-- drivers/cpufreq/cpufreq_governor.c | 2 +- drivers/cpufreq/cpufreq_stats.c | 1 - drivers/macintosh/rack-meter.c | 2 +- fs/proc/stat.c | 68 +++++++++++++++--------------- fs/proc/uptime.c | 7 +-- kernel/sched/cpuacct.c | 2 +- kernel/sched/cputime.c | 22 +++++----- 9 files changed, 61 insertions(+), 65 deletions(-)
diff --git a/arch/s390/appldata/appldata_os.c b/arch/s390/appldata/appldata_os.c index 69b23b25ac34..08b9e942a262 100644 --- a/arch/s390/appldata/appldata_os.c +++ b/arch/s390/appldata/appldata_os.c @@ -113,21 +113,21 @@ static void appldata_get_os_data(void *data) j = 0; for_each_online_cpu(i) { os_data->os_cpu[j].per_cpu_user =
cputime_to_jiffies(kcpustat_cpu(i).cpustat[CPUTIME_USER]);
os_data->os_cpu[j].per_cpu_nice =nsecs_to_jiffies(kcpustat_cpu(i).cpustat[CPUTIME_USER]);
cputime_to_jiffies(kcpustat_cpu(i).cpustat[CPUTIME_NICE]);
os_data->os_cpu[j].per_cpu_system =nsecs_to_jiffies(kcpustat_cpu(i).cpustat[CPUTIME_NICE]);
cputime_to_jiffies(kcpustat_cpu(i).cpustat[CPUTIME_SYSTEM]);
os_data->os_cpu[j].per_cpu_idle =nsecs_to_jiffies(kcpustat_cpu(i).cpustat[CPUTIME_SYSTEM]);
cputime_to_jiffies(kcpustat_cpu(i).cpustat[CPUTIME_IDLE]);
os_data->os_cpu[j].per_cpu_irq =nsecs_to_jiffies(kcpustat_cpu(i).cpustat[CPUTIME_IDLE]);
cputime_to_jiffies(kcpustat_cpu(i).cpustat[CPUTIME_IRQ]);
os_data->os_cpu[j].per_cpu_softirq =nsecs_to_jiffies(kcpustat_cpu(i).cpustat[CPUTIME_IRQ]);
cputime_to_jiffies(kcpustat_cpu(i).cpustat[CPUTIME_SOFTIRQ]);
os_data->os_cpu[j].per_cpu_iowait =nsecs_to_jiffies(kcpustat_cpu(i).cpustat[CPUTIME_SOFTIRQ]);
cputime_to_jiffies(kcpustat_cpu(i).cpustat[CPUTIME_IOWAIT]);
os_data->os_cpu[j].per_cpu_steal =nsecs_to_jiffies(kcpustat_cpu(i).cpustat[CPUTIME_IOWAIT]);
cputime_to_jiffies(kcpustat_cpu(i).cpustat[CPUTIME_STEAL]);
os_data->os_cpu[j].cpu_id = i; j++; }nsecs_to_jiffies(kcpustat_cpu(i).cpustat[CPUTIME_STEAL]);
diff --git a/drivers/cpufreq/cpufreq.c b/drivers/cpufreq/cpufreq.c index af5eff6835a8..d6d91e8afa9e 100644 --- a/drivers/cpufreq/cpufreq.c +++ b/drivers/cpufreq/cpufreq.c @@ -132,7 +132,7 @@ static inline u64 get_cpu_idle_time_jiffy(unsigned int cpu, u64 *wall) u64 cur_wall_time; u64 busy_time;
- cur_wall_time = jiffies64_to_cputime64(get_jiffies_64());
- cur_wall_time = jiffies64_to_nsecs(get_jiffies_64());
busy_time = kcpustat_cpu(cpu).cpustat[CPUTIME_USER]; busy_time += kcpustat_cpu(cpu).cpustat[CPUTIME_SYSTEM]; @@ -143,9 +143,9 @@ static inline u64 get_cpu_idle_time_jiffy(unsigned int cpu, u64 *wall) idle_time = cur_wall_time - busy_time; if (wall)
*wall = cputime_to_usecs(cur_wall_time);
*wall = div_u64(cur_wall_time, NSEC_PER_USEC);
- return cputime_to_usecs(idle_time);
- return div_u64(idle_time, NSEC_PER_USEC);
} u64 get_cpu_idle_time(unsigned int cpu, u64 *wall, int io_busy) diff --git a/drivers/cpufreq/cpufreq_governor.c b/drivers/cpufreq/cpufreq_governor.c index 642dd0f183a8..38d1a8216084 100644 --- a/drivers/cpufreq/cpufreq_governor.c +++ b/drivers/cpufreq/cpufreq_governor.c @@ -152,7 +152,7 @@ unsigned int dbs_update(struct cpufreq_policy *policy) if (ignore_nice) { u64 cur_nice = kcpustat_cpu(j).cpustat[CPUTIME_NICE];
idle_time += cputime_to_usecs(cur_nice - j_cdbs->prev_cpu_nice);
}idle_time += div_u64(cur_nice - j_cdbs->prev_cpu_nice, NSEC_PER_USEC); j_cdbs->prev_cpu_nice = cur_nice;
diff --git a/drivers/cpufreq/cpufreq_stats.c b/drivers/cpufreq/cpufreq_stats.c index 06d3abdffd3a..b084708fd113 100644 --- a/drivers/cpufreq/cpufreq_stats.c +++ b/drivers/cpufreq/cpufreq_stats.c @@ -13,7 +13,6 @@ #include <linux/cpufreq.h> #include <linux/module.h> #include <linux/slab.h> -#include <linux/cputime.h> static DEFINE_SPINLOCK(cpufreq_stats_lock); diff --git a/drivers/macintosh/rack-meter.c b/drivers/macintosh/rack-meter.c index 25852e399ab2..13b16f34256c 100644 --- a/drivers/macintosh/rack-meter.c +++ b/drivers/macintosh/rack-meter.c @@ -91,7 +91,7 @@ static inline cputime64_t get_cpu_idle_time(unsigned int cpu) if (rackmeter_ignore_nice) retval += kcpustat_cpu(cpu).cpustat[CPUTIME_NICE];
- return retval;
- return nsecs_to_cputime64(retval);
Turns out this patch breaks the build here on powerpc :(
And I can't easily figure out how to fix it up.
So I'm going to drop this series from the 4.9 tree right now. Can you fix this up and resend?
thanks,
greg k-h
From: Frederic Weisbecker fweisbec@gmail.com
commit a499a5a14dbd1d0315a96fc62a8798059325e9e6 upstream.
The irqtime is accounted is nsecs and stored in cpu_irq_time.hardirq_time and cpu_irq_time.softirq_time. Once the accumulated amount reaches a new jiffy, this one gets accounted to the kcpustat.
This was necessary when kcpustat was stored in cputime_t, which could at worst have jiffies granularity. But now kcpustat is stored in nsecs so this whole discretization game with temporary irqtime storage has become unnecessary.
We can now directly account the irqtime to the kcpustat.
Signed-off-by: Frederic Weisbecker fweisbec@gmail.com Cc: Benjamin Herrenschmidt benh@kernel.crashing.org Cc: Fenghua Yu fenghua.yu@intel.com Cc: Heiko Carstens heiko.carstens@de.ibm.com Cc: Linus Torvalds torvalds@linux-foundation.org Cc: Martin Schwidefsky schwidefsky@de.ibm.com Cc: Michael Ellerman mpe@ellerman.id.au Cc: Paul Mackerras paulus@samba.org Cc: Peter Zijlstra peterz@infradead.org Cc: Rik van Riel riel@redhat.com Cc: Stanislaw Gruszka sgruszka@redhat.com Cc: Thomas Gleixner tglx@linutronix.de Cc: Tony Luck tony.luck@intel.com Cc: Wanpeng Li wanpeng.li@hotmail.com Link: http://lkml.kernel.org/r/1485832191-26889-17-git-send-email-fweisbec@gmail.c... Signed-off-by: Ingo Molnar mingo@kernel.org Signed-off-by: Ivan Delalande colona@arista.com --- kernel/sched/cputime.c | 50 ++++++++++++++---------------------------- kernel/sched/sched.h | 7 +++--- 2 files changed, 21 insertions(+), 36 deletions(-)
diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c index c07f0cfea07e..2c2060c99896 100644 --- a/kernel/sched/cputime.c +++ b/kernel/sched/cputime.c @@ -44,6 +44,7 @@ void disable_sched_clock_irqtime(void) void irqtime_account_irq(struct task_struct *curr) { struct irqtime *irqtime = this_cpu_ptr(&cpu_irqtime); + u64 *cpustat = kcpustat_this_cpu->cpustat; s64 delta; int cpu;
@@ -61,49 +62,35 @@ void irqtime_account_irq(struct task_struct *curr) * in that case, so as not to confuse scheduler with a special task * that do not consume any time, but still wants to run. */ - if (hardirq_count()) - irqtime->hardirq_time += delta; - else if (in_serving_softirq() && curr != this_cpu_ksoftirqd()) - irqtime->softirq_time += delta; + if (hardirq_count()) { + cpustat[CPUTIME_IRQ] += delta; + irqtime->tick_delta += delta; + } else if (in_serving_softirq() && curr != this_cpu_ksoftirqd()) { + cpustat[CPUTIME_SOFTIRQ] += delta; + irqtime->tick_delta += delta; + }
u64_stats_update_end(&irqtime->sync); } EXPORT_SYMBOL_GPL(irqtime_account_irq);
-static cputime_t irqtime_account_update(u64 irqtime, int idx, cputime_t maxtime) +static cputime_t irqtime_tick_accounted(cputime_t maxtime) { - u64 *cpustat = kcpustat_this_cpu->cpustat; - cputime_t irq_cputime; - - irq_cputime = nsecs_to_cputime64(irqtime - cpustat[idx]); - irq_cputime = min(irq_cputime, maxtime); - cpustat[idx] += cputime_to_nsecs(irq_cputime); - - return irq_cputime; -} + struct irqtime *irqtime = this_cpu_ptr(&cpu_irqtime); + cputime_t delta;
-static cputime_t irqtime_account_hi_update(cputime_t maxtime) -{ - return irqtime_account_update(__this_cpu_read(cpu_irqtime.hardirq_time), - CPUTIME_IRQ, maxtime); -} + delta = nsecs_to_cputime(irqtime->tick_delta); + delta = min(delta, maxtime); + irqtime->tick_delta -= cputime_to_nsecs(delta);
-static cputime_t irqtime_account_si_update(cputime_t maxtime) -{ - return irqtime_account_update(__this_cpu_read(cpu_irqtime.softirq_time), - CPUTIME_SOFTIRQ, maxtime); + return delta; }
#else /* CONFIG_IRQ_TIME_ACCOUNTING */
#define sched_clock_irqtime (0)
-static cputime_t irqtime_account_hi_update(cputime_t dummy) -{ - return 0; -} - -static cputime_t irqtime_account_si_update(cputime_t dummy) +static cputime_t irqtime_tick_accounted(cputime_t dummy) { return 0; } @@ -290,10 +277,7 @@ static inline cputime_t account_other_time(cputime_t max) accounted = steal_account_process_time(max);
if (accounted < max) - accounted += irqtime_account_hi_update(max - accounted); - - if (accounted < max) - accounted += irqtime_account_si_update(max - accounted); + accounted += irqtime_tick_accounted(max - accounted);
return accounted; } diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index f564a1d2c9d5..187c556196af 100644 --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -4,6 +4,7 @@ #include <linux/sched/rt.h> #include <linux/u64_stats_sync.h> #include <linux/sched/deadline.h> +#include <linux/kernel_stat.h> #include <linux/binfmts.h> #include <linux/mutex.h> #include <linux/spinlock.h> @@ -1742,8 +1743,7 @@ static inline void nohz_balance_exit_idle(unsigned int cpu) { }
#ifdef CONFIG_IRQ_TIME_ACCOUNTING struct irqtime { - u64 hardirq_time; - u64 softirq_time; + u64 tick_delta; u64 irq_start_time; struct u64_stats_sync sync; }; @@ -1753,12 +1753,13 @@ DECLARE_PER_CPU(struct irqtime, cpu_irqtime); static inline u64 irq_time_read(int cpu) { struct irqtime *irqtime = &per_cpu(cpu_irqtime, cpu); + u64 *cpustat = kcpustat_cpu(cpu).cpustat; unsigned int seq; u64 total;
do { seq = __u64_stats_fetch_begin(&irqtime->sync); - total = irqtime->softirq_time + irqtime->hardirq_time; + total = cpustat[CPUTIME_SOFTIRQ] + cpustat[CPUTIME_IRQ]; } while (__u64_stats_fetch_retry(&irqtime->sync, seq));
return total;
From: Frederic Weisbecker fweisbec@gmail.com
commit 25e2d8c1b9e327ed260edd13169cc22bc7a78bc6 upstream.
irq_time_read() returns the irqtime minus the ksoftirqd time. This is necessary because irq_time_read() is used to substract the IRQ time from the sum_exec_runtime of a task. If we were to include the softirq time of ksoftirqd, this task would substract its own CPU time everytime it updates ksoftirqd->sum_exec_runtime which would therefore never progress.
But this behaviour got broken by:
a499a5a14db ("sched/cputime: Increment kcpustat directly on irqtime account")
... which now includes ksoftirqd softirq time in the time returned by irq_time_read().
This has resulted in wrong ksoftirqd cputime reported to userspace through /proc/stat and thus "top" not showing ksoftirqd when it should after intense networking load.
ksoftirqd->stime happens to be correct but it gets scaled down by sum_exec_runtime through task_cputime_adjusted().
To fix this, just account the strict IRQ time in a separate counter and use it to report the IRQ time.
Reported-and-tested-by: Jesper Dangaard Brouer brouer@redhat.com Signed-off-by: Frederic Weisbecker fweisbec@gmail.com Reviewed-by: Rik van Riel riel@redhat.com Acked-by: Jesper Dangaard Brouer brouer@redhat.com Cc: Linus Torvalds torvalds@linux-foundation.org Cc: Peter Zijlstra peterz@infradead.org Cc: Stanislaw Gruszka sgruszka@redhat.com Cc: Thomas Gleixner tglx@linutronix.de Cc: Wanpeng Li wanpeng.li@hotmail.com Link: http://lkml.kernel.org/r/1493129448-5356-1-git-send-email-fweisbec@gmail.com Signed-off-by: Ingo Molnar mingo@kernel.org Signed-off-by: Ivan Delalande colona@arista.com --- kernel/sched/cputime.c | 27 ++++++++++++++++----------- kernel/sched/sched.h | 9 +++++++-- 2 files changed, 23 insertions(+), 13 deletions(-)
diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c index 2c2060c99896..448d6426fa5f 100644 --- a/kernel/sched/cputime.c +++ b/kernel/sched/cputime.c @@ -37,6 +37,18 @@ void disable_sched_clock_irqtime(void) sched_clock_irqtime = 0; }
+static void irqtime_account_delta(struct irqtime *irqtime, u64 delta, + enum cpu_usage_stat idx) +{ + u64 *cpustat = kcpustat_this_cpu->cpustat; + + u64_stats_update_begin(&irqtime->sync); + cpustat[idx] += delta; + irqtime->total += delta; + irqtime->tick_delta += delta; + u64_stats_update_end(&irqtime->sync); +} + /* * Called before incrementing preempt_count on {soft,}irq_enter * and before decrementing preempt_count on {soft,}irq_exit. @@ -44,7 +56,6 @@ void disable_sched_clock_irqtime(void) void irqtime_account_irq(struct task_struct *curr) { struct irqtime *irqtime = this_cpu_ptr(&cpu_irqtime); - u64 *cpustat = kcpustat_this_cpu->cpustat; s64 delta; int cpu;
@@ -55,22 +66,16 @@ void irqtime_account_irq(struct task_struct *curr) delta = sched_clock_cpu(cpu) - irqtime->irq_start_time; irqtime->irq_start_time += delta;
- u64_stats_update_begin(&irqtime->sync); /* * We do not account for softirq time from ksoftirqd here. * We want to continue accounting softirq time to ksoftirqd thread * in that case, so as not to confuse scheduler with a special task * that do not consume any time, but still wants to run. */ - if (hardirq_count()) { - cpustat[CPUTIME_IRQ] += delta; - irqtime->tick_delta += delta; - } else if (in_serving_softirq() && curr != this_cpu_ksoftirqd()) { - cpustat[CPUTIME_SOFTIRQ] += delta; - irqtime->tick_delta += delta; - } - - u64_stats_update_end(&irqtime->sync); + if (hardirq_count()) + irqtime_account_delta(irqtime, delta, CPUTIME_IRQ); + else if (in_serving_softirq() && curr != this_cpu_ksoftirqd()) + irqtime_account_delta(irqtime, delta, CPUTIME_SOFTIRQ); } EXPORT_SYMBOL_GPL(irqtime_account_irq);
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index 187c556196af..923cc35e8490 100644 --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -1743,6 +1743,7 @@ static inline void nohz_balance_exit_idle(unsigned int cpu) { }
#ifdef CONFIG_IRQ_TIME_ACCOUNTING struct irqtime { + u64 total; u64 tick_delta; u64 irq_start_time; struct u64_stats_sync sync; @@ -1750,16 +1751,20 @@ struct irqtime {
DECLARE_PER_CPU(struct irqtime, cpu_irqtime);
+/* + * Returns the irqtime minus the softirq time computed by ksoftirqd. + * Otherwise ksoftirqd's sum_exec_runtime is substracted its own runtime + * and never move forward. + */ static inline u64 irq_time_read(int cpu) { struct irqtime *irqtime = &per_cpu(cpu_irqtime, cpu); - u64 *cpustat = kcpustat_cpu(cpu).cpustat; unsigned int seq; u64 total;
do { seq = __u64_stats_fetch_begin(&irqtime->sync); - total = cpustat[CPUTIME_SOFTIRQ] + cpustat[CPUTIME_IRQ]; + total = irqtime->total; } while (__u64_stats_fetch_retry(&irqtime->sync, seq));
return total;
Hey Greg,
Did you get a chance to look at this series? Thanks a lot!
On Thu, Sep 06, 2018 at 06:42:14PM -0700, Ivan Delalande wrote:
This series fixes issues we've seen with softirq time accounting in 4.9:
- when ksoftirqd is running at 100% on a CPU, none of the values reported by /proc/stat for that CPU will change, sometimes for dozens of seconds,
- large deviations in the total number of ticks accumulated over a fixed time for a CPU, probably because of the first issue hitting for shorter periods.
We found out that something pretty similar had been reported 9 months ago, see the reference link below. In that discussion, Rabin Vincent had made a 4.9 specific patch which fixes our first issue, but we were still seeing some deviation from the total number of ticks (up to 1.7% from expected, where we had only 0.2% on older kernels), and you had also asked for a direct backport from the mainline series, if possible.
As mentioned in that thread, a lot of changes (probably 50+) went into 4.11 to remove cputime, but we could get something working with only the 4 attached patches to fix these two issues. Three of these patches apply without change, and the second one in the series ("sched/cputime: Convert kcpustat to nsecs") needed a minor change as a cast had been added in 527b0a76f41d ("sched/cpuacct: Avoid %lld seq_printf warning") to fix a build warning on s390. I guess we could also include that patch in this series, let me know if this is the preferred way to handle this.
We ran our tests on 3.18, 4.4 and 4.9 and confirmed that only 4.9 would need this series, and that this series indeed restores the behavior we were seeing on those older kernels.
Thanks!
Reference: http://lkml.kernel.org/r/%3C1513159876-5125-1-git-send-email-rabin.vincent@a...
Frederic Weisbecker (4): time: Introduce jiffies64_to_nsecs() sched/cputime: Convert kcpustat to nsecs sched/cputime: Increment kcpustat directly on irqtime account sched/cputime: Fix ksoftirqd cputime accounting regression
arch/s390/appldata/appldata_os.c | 16 +++---- drivers/cpufreq/cpufreq.c | 6 +-- drivers/cpufreq/cpufreq_governor.c | 2 +- drivers/cpufreq/cpufreq_stats.c | 1 - drivers/macintosh/rack-meter.c | 2 +- fs/proc/stat.c | 68 +++++++++++++-------------- fs/proc/uptime.c | 7 +-- include/linux/jiffies.h | 2 + kernel/sched/cpuacct.c | 2 +- kernel/sched/cputime.c | 75 +++++++++++++----------------- kernel/sched/sched.h | 12 +++-- kernel/time/time.c | 10 ++++ kernel/time/timeconst.bc | 6 +++ 13 files changed, 109 insertions(+), 100 deletions(-)
On Thu, Sep 13, 2018 at 04:13:17PM -0700, Ivan Delalande wrote:
Hey Greg,
Did you get a chance to look at this series? Thanks a lot!
Not yet, sorry, I was digging myself out from under the patch backlog of all of the other stuff sent before this series.
It isn't lost, don't worry, give it a chance. It's also almost a 2 year old kernel you are using here, so there can't be a lot of rush :)
thanks,
greg k-h
On Fri, Sep 14, 2018 at 08:57:48AM +0200, Greg Kroah-Hartman wrote:
On Thu, Sep 13, 2018 at 04:13:17PM -0700, Ivan Delalande wrote:
Hey Greg,
Did you get a chance to look at this series? Thanks a lot!
Not yet, sorry, I was digging myself out from under the patch backlog of all of the other stuff sent before this series.
It isn't lost, don't worry, give it a chance. It's also almost a 2 year old kernel you are using here, so there can't be a lot of rush :)
All now queued up, thanks.
greg k-h
linux-stable-mirror@lists.linaro.org