While migrating to common clock framework (CCF), found that the FIMD clocks
were pulled down by the CCF.
If CCF finds any clock(s) which has NOT been claimed by any of the
drivers, then such clock(s) are PULLed low by CCF.
By calling clk_prepare_enable() for FIMD clocks fixes the issue.
Signed-off-by: Vikas Sajjan <vikas.sajjan(a)linaro.org>
---
drivers/gpu/drm/exynos/exynos_drm_fimd.c | 3 +++
1 file changed, 3 insertions(+)
diff --git a/drivers/gpu/drm/exynos/exynos_drm_fimd.c b/drivers/gpu/drm/exynos/exynos_drm_fimd.c
index 9537761..d93dd8a 100644
--- a/drivers/gpu/drm/exynos/exynos_drm_fimd.c
+++ b/drivers/gpu/drm/exynos/exynos_drm_fimd.c
@@ -934,6 +934,9 @@ static int fimd_probe(struct platform_device *pdev)
return ret;
}
+ clk_prepare_enable(ctx->lcd_clk);
+ clk_prepare_enable(ctx->bus_clk);
+
ctx->vidcon0 = pdata->vidcon0;
ctx->vidcon1 = pdata->vidcon1;
ctx->default_win = pdata->default_win;
--
1.7.9.5
When a cpu goes to a deep idle state where its local timer is shutdown,
it notifies the time framework to use the broadcast timer instead.
Unfortunately, the broadcast device could wake up any CPU, including an
idle one which is not concerned by the wake up at all.
This implies, in the worst case, an idle CPU will wake up to send an IPI
to another idle cpu.
This patch solves this by setting the irq affinity to the cpu concerned
by the nearest timer event, by this way, the CPU which is wake up is
guarantee to be the one concerned by the next event and we are safe with
unnecessary wakeup for another idle CPU.
As the irq affinity is not supported by all the archs, a flag is needed
to specify which clocksource can handle it : CLOCK_EVT_FEAT_DYNIRQ.
Tested on a u8500 board with a test program doing indefinitely usleep 10000
wired on each CPU.
With dynamic irq affinity:
Log is 10.042298 secs long with 4190 events
cpu0/state0, 24 hits, total 2718.00us, avg 113.25us, min 0.00us, max 854.00us
cpu0/state1, 994 hits, total 9874827.00us, avg 9934.43us, min 30.00us, max 10346.00us
cpu1/state0, 73 hits, total 17001.00us, avg 232.89us, min 0.00us, max 10040.00us
cpu1/state1, 1002 hits, total 9883840.00us, avg 9864.11us, min 0.00us, max 10742.00us
cluster/state0, 0 hits, total 0.00us, avg 0.00us, min 0.00us, max 0.00us
cluster/state1, 1931 hits, total 9762328.00us, avg 5055.58us, min 30.00us, max 9308.00us
Without dynamic irq affinity:
Log is 10.036834 secs long with 6574 events
cpu0/state0, 114 hits, total 20107.00us, avg 176.38us, min 0.00us, max 7233.00us
cpu0/state1, 1951 hits, total 9833836.00us, avg 5040.41us, min 0.00us, max 9217.00us
cpu1/state0, 223 hits, total 21140.00us, avg 94.80us, min 0.00us, max 2960.00us
cpu1/state1, 997 hits, total 9879748.00us, avg 9909.48us, min 0.00us, max 10346.00us
cluster/state0, 5 hits, total 5462.00us, avg 1092.40us, min 580.00us, max 2899.00us
cluster/state1, 2298 hits, total 9740988.00us, avg 4238.90us, min 30.00us, max 9217.00us
Results for the specific test case 'usleep 10000'
* reduced by 40% the number of wake up on the system
* reduced by 49% the number of wake up for CPU0
* increased by factor two idle time for CPU0
* increase by 16% package idle hits + 16% average package idle time
Changelog:
V2 :
* mentioned CLOCK_EVT_FEAT_DYNIRQ flag name in patch description
* added comments for CLOCK_EVT_FEAT_DYNIRQ
* replaced tick_broadcast_set_affinity parameter to use a cpumask
V1 : initial post
Daniel Lezcano (3):
time : pass broadcast parameter
time : set broadcast irq affinity
ARM: nomadik: add dynamic irq flag to the timer
Viresh Kumar (1):
ARM: timer-sp: Set dynamic irq affinity
arch/arm/common/timer-sp.c | 3 ++-
drivers/clocksource/nomadik-mtu.c | 3 ++-
include/linux/clockchips.h | 5 +++++
kernel/time/tick-broadcast.c | 41 +++++++++++++++++++++++++++++--------
4 files changed, 42 insertions(+), 10 deletions(-)
--
1.7.9.5
On my smp platform which is made of 5 cores in 2 clusters, I have the
nr_busy_cpu field of sched_group_power struct that is not null when the
platform is fully idle. The root cause is:
During the boot sequence, some CPUs reach the idle loop and set their
NOHZ_IDLE flag while waiting for others CPUs to boot. But the nr_busy_cpus
field is initialized later with the assumption that all CPUs are in the busy
state whereas some CPUs have already set their NOHZ_IDLE flag.
More generally, the NOHZ_IDLE flag must be initialized when new sched_domains
are created in order to ensure that NOHZ_IDLE and nr_busy_cpus are aligned.
This condition can be ensured by adding a synchronize_rcu between the
destruction of old sched_domains and the creation of new ones so the NOHZ_IDLE
flag will not be updated with old sched_domain once it has been initialized.
But this solution introduces a additionnal latency in the rebuild sequence
that is called during cpu hotplug.
As suggested by Frederic Weisbecker, another solution is to have the same
rcu lifecycle for both NOHZ_IDLE and sched_domain struct. I have introduce
a new sched_domain_rq struct that is the entry point for both sched_domains
and objects that must follow the same lifecycle like NOHZ_IDLE flags. They
will share the same RCU lifecycle and will be always synchronized.
The synchronization is done at the cost of :
- an additional indirection for accessing the first sched_domain level
- an additional indirection and a rcu_dereference before accessing to the
NOHZ_IDLE flag.
Change since v4:
- link both sched_domain and NOHZ_IDLE flag in one RCU object so
their states are always synchronized.
Change since V3;
- NOHZ flag is not cleared if a NULL domain is attached to the CPU
- Remove patch 2/2 which becomes useless with latest modifications
Change since V2:
- change the initialization to idle state instead of busy state so a CPU that
enters idle during the build of the sched_domain will not corrupt the
initialization state
Change since V1:
- remove the patch for SCHED softirq on an idle core use case as it was
a side effect of the other use cases.
Signed-off-by: Vincent Guittot <vincent.guittot(a)linaro.org>
---
include/linux/sched.h | 6 +++
kernel/sched/core.c | 105 ++++++++++++++++++++++++++++++++++++++++++++-----
kernel/sched/fair.c | 35 +++++++++++------
kernel/sched/sched.h | 24 +++++++++--
4 files changed, 145 insertions(+), 25 deletions(-)
diff --git a/include/linux/sched.h b/include/linux/sched.h
index d35d2b6..2a52188 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -959,6 +959,12 @@ struct sched_domain {
unsigned long span[0];
};
+struct sched_domain_rq {
+ struct sched_domain *sd;
+ unsigned long flags;
+ struct rcu_head rcu; /* used during destruction */
+};
+
static inline struct cpumask *sched_domain_span(struct sched_domain *sd)
{
return to_cpumask(sd->span);
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 7f12624..69e2313 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -5602,6 +5602,15 @@ static void destroy_sched_domains(struct sched_domain *sd, int cpu)
destroy_sched_domain(sd, cpu);
}
+static void destroy_sched_domain_rq(struct sched_domain_rq *sd_rq, int cpu)
+{
+ if (!sd_rq)
+ return;
+
+ destroy_sched_domains(sd_rq->sd, cpu);
+ kfree_rcu(sd_rq, rcu);
+}
+
/*
* Keep a special pointer to the highest sched_domain that has
* SD_SHARE_PKG_RESOURCE set (Last Level Cache Domain) for this
@@ -5632,10 +5641,23 @@ static void update_top_cache_domain(int cpu)
* hold the hotplug lock.
*/
static void
-cpu_attach_domain(struct sched_domain *sd, struct root_domain *rd, int cpu)
+cpu_attach_domain(struct sched_domain_rq *sd_rq, struct root_domain *rd,
+ int cpu)
{
struct rq *rq = cpu_rq(cpu);
- struct sched_domain *tmp;
+ struct sched_domain_rq *tmp_rq;
+ struct sched_domain *tmp, *sd = NULL;
+
+ /*
+ * If we don't have any sched_domain and associated object, we can
+ * directly jump to the attach sequence otherwise we try to degenerate
+ * the sched_domain
+ */
+ if (!sd_rq)
+ goto attach;
+
+ /* Get a pointer to the 1st sched_domain */
+ sd = sd_rq->sd;
/* Remove the sched domains which do not contribute to scheduling. */
for (tmp = sd; tmp; ) {
@@ -5658,14 +5680,17 @@ cpu_attach_domain(struct sched_domain *sd, struct root_domain *rd, int cpu)
destroy_sched_domain(tmp, cpu);
if (sd)
sd->child = NULL;
+ /* update sched_domain_rq */
+ sd_rq->sd = sd;
}
+attach:
sched_domain_debug(sd, cpu);
rq_attach_root(rq, rd);
- tmp = rq->sd;
- rcu_assign_pointer(rq->sd, sd);
- destroy_sched_domains(tmp, cpu);
+ tmp_rq = rq->sd_rq;
+ rcu_assign_pointer(rq->sd_rq, sd_rq);
+ destroy_sched_domain_rq(tmp_rq, cpu);
update_top_cache_domain(cpu);
}
@@ -5695,12 +5720,14 @@ struct sd_data {
};
struct s_data {
+ struct sched_domain_rq ** __percpu sd_rq;
struct sched_domain ** __percpu sd;
struct root_domain *rd;
};
enum s_alloc {
sa_rootdomain,
+ sa_sd_rq,
sa_sd,
sa_sd_storage,
sa_none,
@@ -5935,7 +5962,7 @@ static void init_sched_groups_power(int cpu, struct sched_domain *sd)
return;
update_group_power(sd, cpu);
- atomic_set(&sg->sgp->nr_busy_cpus, sg->group_weight);
+ atomic_set(&sg->sgp->nr_busy_cpus, 0);
}
int __weak arch_sd_sibling_asym_packing(void)
@@ -6011,6 +6038,8 @@ static void set_domain_attribute(struct sched_domain *sd,
static void __sdt_free(const struct cpumask *cpu_map);
static int __sdt_alloc(const struct cpumask *cpu_map);
+static void __sdrq_free(const struct cpumask *cpu_map, struct s_data *d);
+static int __sdrq_alloc(const struct cpumask *cpu_map, struct s_data *d);
static void __free_domain_allocs(struct s_data *d, enum s_alloc what,
const struct cpumask *cpu_map)
@@ -6019,6 +6048,9 @@ static void __free_domain_allocs(struct s_data *d, enum s_alloc what,
case sa_rootdomain:
if (!atomic_read(&d->rd->refcount))
free_rootdomain(&d->rd->rcu); /* fall through */
+ case sa_sd_rq:
+ __sdrq_free(cpu_map, d); /* fall through */
+ free_percpu(d->sd_rq); /* fall through */
case sa_sd:
free_percpu(d->sd); /* fall through */
case sa_sd_storage:
@@ -6038,9 +6070,14 @@ static enum s_alloc __visit_domain_allocation_hell(struct s_data *d,
d->sd = alloc_percpu(struct sched_domain *);
if (!d->sd)
return sa_sd_storage;
+ d->sd_rq = alloc_percpu(struct sched_domain_rq *);
+ if (!d->sd_rq)
+ return sa_sd;
+ if (__sdrq_alloc(cpu_map, d))
+ return sa_sd_rq;
d->rd = alloc_rootdomain();
if (!d->rd)
- return sa_sd;
+ return sa_sd_rq;
return sa_rootdomain;
}
@@ -6466,6 +6503,46 @@ static void __sdt_free(const struct cpumask *cpu_map)
}
}
+static int __sdrq_alloc(const struct cpumask *cpu_map, struct s_data *d)
+{
+ int j;
+
+ for_each_cpu(j, cpu_map) {
+ struct sched_domain_rq *sd_rq;
+
+ sd_rq = kzalloc_node(sizeof(struct sched_domain_rq),
+ GFP_KERNEL, cpu_to_node(j));
+ if (!sd_rq)
+ return -ENOMEM;
+
+ *per_cpu_ptr(d->sd_rq, j) = sd_rq;
+ }
+
+ return 0;
+}
+
+static void __sdrq_free(const struct cpumask *cpu_map, struct s_data *d)
+{
+ int j;
+
+ for_each_cpu(j, cpu_map)
+ if (*per_cpu_ptr(d->sd_rq, j))
+ kfree(*per_cpu_ptr(d->sd_rq, j));
+}
+
+static void build_sched_domain_rq(struct s_data *d, int cpu)
+{
+ struct sched_domain_rq *sd_rq;
+ struct sched_domain *sd;
+
+ /* Attach sched_domain to sched_domain_rq */
+ sd = *per_cpu_ptr(d->sd, cpu);
+ sd_rq = *per_cpu_ptr(d->sd_rq, cpu);
+ sd_rq->sd = sd;
+ /* Init flags */
+ set_bit(NOHZ_IDLE, sched_rq_flags(sd_rq));
+}
+
struct sched_domain *build_sched_domain(struct sched_domain_topology_level *tl,
struct s_data *d, const struct cpumask *cpu_map,
struct sched_domain_attr *attr, struct sched_domain *child,
@@ -6495,6 +6572,7 @@ static int build_sched_domains(const struct cpumask *cpu_map,
struct sched_domain_attr *attr)
{
enum s_alloc alloc_state = sa_none;
+ struct sched_domain_rq *sd_rq;
struct sched_domain *sd;
struct s_data d;
int i, ret = -ENOMEM;
@@ -6547,11 +6625,18 @@ static int build_sched_domains(const struct cpumask *cpu_map,
}
}
+ /* Init objects that must follow the sched_domain lifecycle */
+ for_each_cpu(i, cpu_map) {
+ build_sched_domain_rq(&d, i);
+ }
+
/* Attach the domains */
rcu_read_lock();
for_each_cpu(i, cpu_map) {
- sd = *per_cpu_ptr(d.sd, i);
- cpu_attach_domain(sd, d.rd, i);
+ sd_rq = *per_cpu_ptr(d.sd_rq, i);
+ cpu_attach_domain(sd_rq, d.rd, i);
+ /* claim allocation of sched_domain_rq object */
+ *per_cpu_ptr(d.sd_rq, i) = NULL;
}
rcu_read_unlock();
@@ -6982,7 +7067,7 @@ void __init sched_init(void)
rq->last_load_update_tick = jiffies;
#ifdef CONFIG_SMP
- rq->sd = NULL;
+ rq->sd_rq = NULL;
rq->rd = NULL;
rq->cpu_power = SCHED_POWER_SCALE;
rq->post_schedule = 0;
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 7a33e59..1c7447e 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5392,31 +5392,39 @@ static inline void nohz_balance_exit_idle(int cpu)
static inline void set_cpu_sd_state_busy(void)
{
+ struct sched_domain_rq *sd_rq;
struct sched_domain *sd;
int cpu = smp_processor_id();
- if (!test_bit(NOHZ_IDLE, nohz_flags(cpu)))
- return;
- clear_bit(NOHZ_IDLE, nohz_flags(cpu));
-
rcu_read_lock();
- for_each_domain(cpu, sd)
+ sd_rq = get_sched_domain_rq(cpu);
+
+ if (!sd_rq || !test_bit(NOHZ_IDLE, sched_rq_flags(sd_rq)))
+ goto unlock;
+ clear_bit(NOHZ_IDLE, sched_rq_flags(sd_rq));
+
+ for_each_domain_from_rq(sd_rq, sd)
atomic_inc(&sd->groups->sgp->nr_busy_cpus);
+unlock:
rcu_read_unlock();
}
void set_cpu_sd_state_idle(void)
{
+ struct sched_domain_rq *sd_rq;
struct sched_domain *sd;
int cpu = smp_processor_id();
- if (test_bit(NOHZ_IDLE, nohz_flags(cpu)))
- return;
- set_bit(NOHZ_IDLE, nohz_flags(cpu));
-
rcu_read_lock();
- for_each_domain(cpu, sd)
+ sd_rq = get_sched_domain_rq(cpu);
+
+ if (!sd_rq || test_bit(NOHZ_IDLE, sched_rq_flags(sd_rq)))
+ goto unlock;
+ set_bit(NOHZ_IDLE, sched_rq_flags(sd_rq));
+
+ for_each_domain_from_rq(sd_rq, sd)
atomic_dec(&sd->groups->sgp->nr_busy_cpus);
+unlock:
rcu_read_unlock();
}
@@ -5673,7 +5681,12 @@ static void run_rebalance_domains(struct softirq_action *h)
static inline int on_null_domain(int cpu)
{
- return !rcu_dereference_sched(cpu_rq(cpu)->sd);
+ struct sched_domain_rq *sd_rq =
+ rcu_dereference_sched(cpu_rq(cpu)->sd_rq);
+ struct sched_domain *sd = NULL;
+ if (sd_rq)
+ sd = sd_rq->sd;
+ return !sd;
}
/*
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index cc03cfd..f589306 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -417,7 +417,7 @@ struct rq {
#ifdef CONFIG_SMP
struct root_domain *rd;
- struct sched_domain *sd;
+ struct sched_domain_rq *sd_rq;
unsigned long cpu_power;
@@ -505,21 +505,37 @@ DECLARE_PER_CPU(struct rq, runqueues);
#ifdef CONFIG_SMP
-#define rcu_dereference_check_sched_domain(p) \
+#define rcu_dereference_check_sched_domain_rq(p) \
rcu_dereference_check((p), \
lockdep_is_held(&sched_domains_mutex))
+#define get_sched_domain_rq(cpu) \
+ rcu_dereference_check_sched_domain_rq(cpu_rq(cpu)->sd_rq)
+
+#define rcu_dereference_check_sched_domain(cpu) ({ \
+ struct sched_domain_rq *__sd_rq = get_sched_domain_rq(cpu); \
+ struct sched_domain *__sd = NULL; \
+ if (__sd_rq) \
+ __sd = __sd_rq->sd; \
+ __sd; \
+})
+
+#define sched_rq_flags(sd_rq) (&sd_rq->flags)
+
/*
- * The domain tree (rq->sd) is protected by RCU's quiescent state transition.
+ * The domain tree (rq->sd_rq) is protected by RCU's quiescent state transition.
* See detach_destroy_domains: synchronize_sched for details.
*
* The domain tree of any CPU may only be accessed from within
* preempt-disabled sections.
*/
#define for_each_domain(cpu, __sd) \
- for (__sd = rcu_dereference_check_sched_domain(cpu_rq(cpu)->sd); \
+ for (__sd = rcu_dereference_check_sched_domain(cpu); \
__sd; __sd = __sd->parent)
+#define for_each_domain_from_rq(sd_rq, __sd) \
+ for (__sd = sd_rq->sd; __sd; __sd = __sd->parent)
+
#define for_each_lower_domain(sd) for (; sd; sd = sd->child)
/**
--
1.7.9.5
Guenter,
Please check this v4 patches, thanks for all your reviews/comments for this
patch set.
Anton Vorontsov,
I have add your Acked-by: into patch [1/3]; but for this v4 [2/3], we have to
change it much according to Guenter's feedback and our internal discussions,
so please have a look at this new [2/3] again, thank you.
v3 -> v4 changes:
for patch [3/3]
- define delays in HZ
- update ab8500_read_sensor function, returning temp by parameter
- remove ab8500_is_visible function
- use clamp_val in set_min and set_max callback
- remove unnecessary locks in remove and suspend functions
- let abx500 and ab8500 use its own data structure
for patch [2/3]
- move the data tables from driver/power/ab8500_bmdata.c to
include/linux/power/ab8500.h
- rename driver/power/ab8500_bmdata.c to driver/power/ab8500_bm.c
- rename these variable names to eliminate CamelCase warnings
- add const attribute to these data
v2 -> v3 changes:
- Add interface for converting voltage to temperature
- Remove temp5 sensor since we cannot offer temperature read interface of it
- Update hyst to use absolute temperature instead of a difference
- Add the 3/3 patch
v1 -> v2 changes:
- Add Documentation/hwmon/abx500 and Documentation/hwmon/abx500
- Make devices which cannot report milli-Celsius invisible
- Add temp5_crit interface
- Re-work the old find_active_thresholds() to threshold_updated()
- Reset updated_min_alarm and updated_max_alarm at the end of each loop
- Update the hyst mechamisn to make it works as real hyst
- Remove non-stand attributes
- Re-order the operations sequence inside probe and remove functions
- Update all the lock usages to eliminate race conditions
- Make attibutes index starts from 0
also changes:
- Since the old [1/2] "ARM: ux500: rename ab8500 to abx500 for hwmon driver"
has been merged by Samuel, so won't send it again.
- Add another new patch "ab8500_btemp: export two symblols" as [2/2] of this
patch set.
Hongbo Zhang (3):
ab8500_btemp: make ab8500_btemp_get* interfaces public
ab8500: re-arrange ab8500 power and temperature data tables
hwmon: add ST-Ericsson ABX500 hwmon driver
Documentation/hwmon/ab8500 | 22 ++
Documentation/hwmon/abx500 | 28 ++
drivers/hwmon/Kconfig | 13 +
drivers/hwmon/Makefile | 1 +
drivers/hwmon/ab8500.c | 208 ++++++++++++++
drivers/hwmon/abx500.c | 494 +++++++++++++++++++++++++++++++++
drivers/hwmon/abx500.h | 69 +++++
drivers/power/Makefile | 2 +-
drivers/power/ab8500_bm.c | 341 +++++++++++++++++++++++
drivers/power/ab8500_bmdata.c | 519 -----------------------------------
drivers/power/ab8500_btemp.c | 5 +-
drivers/power/ab8500_fg.c | 4 +-
include/linux/mfd/abx500.h | 6 +-
include/linux/mfd/abx500/ab8500-bm.h | 5 +
include/linux/power/ab8500.h | 189 +++++++++++++
15 files changed, 1380 insertions(+), 526 deletions(-)
create mode 100644 Documentation/hwmon/ab8500
create mode 100644 Documentation/hwmon/abx500
create mode 100644 drivers/hwmon/ab8500.c
create mode 100644 drivers/hwmon/abx500.c
create mode 100644 drivers/hwmon/abx500.h
create mode 100644 drivers/power/ab8500_bm.c
delete mode 100644 drivers/power/ab8500_bmdata.c
create mode 100644 include/linux/power/ab8500.h
--
1.8.0
---------- Forwarded message ----------
From: <linaro-kernel-bounces(a)lists.linaro.org>
Date: 18 March 2013 22:31
Subject: Auto-discard notification
To: linaro-kernel-owner(a)lists.linaro.org
The attached message has been automatically discarded by lists.linaro.org and
hence i am forwarding it again.
--
viresh
---------- Forwarded message ----------
From: Bruce Dawson <bruced(a)valvesoftware.com>
To: 'Vincent Guittot' <vincent.guittot(a)linaro.org>, Viresh Kumar
<viresh.kumar(a)linaro.org>
Cc: Dave Jones <davej(a)redhat.com>, "cpufreq(a)vger.kernel.org"
<cpufreq(a)vger.kernel.org>, "Rafael J. Wysocki" <rjw(a)sisk.pl>,
"linaro-kernel(a)lists.linaro.org" <linaro-kernel(a)lists.linaro.org>
Date: Mon, 18 Mar 2013 17:01:17 +0000
Subject: RE: CPU power management bug -- CPU bound task fails to raise
CPU frequency
I guess that makes sense for the scheduler to look for the idlest CPU
in the system. That's good to know.
I had guessed that the scheduler would do something like that and that
my test would run on two cores. However I find that bash alternates
between two cores, which seems odd. Additionally, each invocation of
expr starts on one core and moves to another, which seems odd that
each invocation lives for a ms or less. The net effect is that six or
more different cores get involved.
Anyway, it is a pathological case, so maybe it doesn't matter, but
given the popularity of $(command) in shell scripts it may not be
completely irrelevant either.
-----Original Message-----
From: Vincent Guittot [mailto:vincent.guittot@linaro.org]
Sent: Monday, March 18, 2013 4:07 AM
To: Viresh Kumar
Cc: Bruce Dawson; Dave Jones; cpufreq(a)vger.kernel.org; Rafael J.
Wysocki; linaro-kernel(a)lists.linaro.org
Subject: Re: CPU power management bug -- CPU bound task fails to raise
CPU frequency
On 18 March 2013 06:04, Viresh Kumar <viresh.kumar(a)linaro.org> wrote:
> Let me get in the scheduler expert from Linaro (Vincent Guittot, would
> be available after few hours)
>
> Vincent, please start reading from this mail:
>
> http://permalink.gmane.org/gmane.linux.kernel.cpufreq/9675
>
> Now, we want to understand how to make this task perform better as
> scheduler is using multiple cpus for it and hence all are staying at
> low freqs, as load isn't enough..
Hi,
Your 1st test creates a task to evaluate each expr, and the fork
sequence of the scheduler looks for the idlest CPU in the system.
That's explain why your test is evenly spread on all CPUs and the
average load of each CPU is below the threshold of cpufreq At the
opposite, your 2nd test uses only one task which stays on one CPU and
trig the frequency increase.
I would say that the scheduler behavior is almost normal : spread to
get best performance (even if in this use case, the threads run
sequentially) but you have this side effect on the cpufreq thats sees
each core individually as not loaded. This example tends to push in
favor of a better cooperation between scheduler and cpufreq for
sharing statistics
Vincent
>
> --
> viresh
>
> On 18 March 2013 10:28, Bruce Dawson <bruced(a)valvesoftware.com> wrote:
>> This is with the Ondemand governor.
>>
>> The more I ponder this the more I think that the real issue is not the frequency drivers, but the kernel scheduler. The shell script involves two processes being alive at any given time, and one process running (since bash always waits for expr to finish). Therefore the entire task should run on either one core or on two. Instead I see (from looking at thread scheduling graphed using the Zoom profiler -- http://www.rotateright.com/) that it runs on six different cores. bash alternates between two cores, and each invocation of expr is started on one core and then moves to another. Given that it seems not surprising that the CPU frequency management doesn't trigger.
>>
>>> So, the frequency might not be increased if there are multiple cpus
>>> running for a specific task and none of them has high enough load at
>>> that time
>>
>> Yep, that's what I figured. Each cpu's load is quite low -- 20% or lower -- because the work is so spread out.
>>
>> If I run the entire thing under "taskset 1" then everything runs on one core, the frequency elevation happens, and the entire task runs roughly three times faster.
>>
>> Crazy/cool.
>>
>> -----Original Message-----
>> From: viresh.linux(a)gmail.com [mailto:viresh.linux@gmail.com] On
>> Behalf Of Viresh Kumar
>> Sent: Sunday, March 17, 2013 9:32 PM
>> To: Bruce Dawson
>> Cc: Dave Jones; cpufreq(a)vger.kernel.org; Rafael J. Wysocki;
>> linaro-kernel(a)lists.linaro.org
>> Subject: Re: CPU power management bug -- CPU bound task fails to
>> raise CPU frequency
>>
>> On Sun, Mar 17, 2013 at 6:37 AM, Bruce Dawson <bruced(a)valvesoftware.com> wrote:
>>> Dave/others, I've come up with a simple (and real) scenario where a CPU bound task running on Ubuntu (and presumably other Linux flavors) fails to be detected as CPU bound by the Linux kernel, meaning that the CPU continues to run at low speed, meaning that this CPU bound task takes (on my machines) about three times longer to run than it should.
>>>
>>> I found these e-mail addresses in the MAINTAINERS list under CPU FREQUENCY DRIVERS which I'm hoping is the correct area.
>>
>> Yes, cpufreq mailing list is the right list for this.
>>
>>> The basic problem is that on a multi-core system if you run a shell script that spawns lots of sub processes then the workload ends up distributed across all of the CPUs. Therefore, since none of the CPUs are particularly busy, the Linux kernel doesn't realize that a CPU bound task is running, so it leaves the CPU frequency set to low. I have confirmed the behavior in multiple ways. Specifically, I have used "iostat 1" and "mpstat -P ALL 1" to confirm that a full core's worth of CPU work is being done. mpstat also showed that the work was distributed across multiple cores. Using the zoom profiler UI for perf showed the sub processes (and bash) being spread across multiple cores, and perf stat showed that the CPU frequency was staying low even though the task was CPU bound.
>>
>> There are few things which would be helpful to understand what's going on.
>> What governor is used in your case? Probably Ondemand (My ubuntu uses this).
>>
>> Ideally, cpu frequency is increased only if cpu load is very high (or
>> above threshold,
>> 95 in my ubuntu). So, the frequency might not be increased if there are multiple cpus running for a specific task and none of them has high enough load at that time.
>>
>> Other stuff that i suspect here is a bug which was solved recently by
>> below patch. If
>> policy->cpu (that might be cpu 0 for you) is sleeping, then load is
>> never evaluated even
>> if all other cpus are very busy. If you can try below patch then it might be helpful. BTW, you might not be able to apply it easily as it has got lots of dependencies.. and so you might need to pick all drivers/cpufreq patches from v3.9-rc1.
>>
>> commit 2abfa876f1117b0ab45f191fb1f82c41b1cbc8fe
>> Author: Rickard Andersson <rickard.andersson(a)stericsson.com>
>> Date: Thu Dec 27 14:55:38 2012 +0000
>>
>> cpufreq: handle SW coordinated CPUs
>>
>> This patch fixes a bug that occurred when we had load on a secondary CPU
>> and the primary CPU was sleeping. Only one sampling timer was spawned
>> and it was spawned as a deferred timer on the primary CPU, so when a
>> secondary CPU had a change in load this was not detected by the cpufreq
>> governor (both ondemand and conservative).
>>
>> This patch make sure that deferred timers are run on all CPUs in the
>> case of software controlled CPUs that run on the same frequency.
>>
>> Signed-off-by: Rickard Andersson <rickard.andersson(a)stericsson.com>
>> Signed-off-by: Fabio Baltieri <fabio.baltieri(a)linaro.org>
>> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki(a)intel.com>
>> ---
>> drivers/cpufreq/cpufreq_conservative.c | 3 ++-
>> drivers/cpufreq/cpufreq_governor.c | 44
>> ++++++++++++++++++++++++++++++++++++++------
>> drivers/cpufreq/cpufreq_governor.h | 1 +
>> drivers/cpufreq/cpufreq_ondemand.c | 3 ++-
>> 4 files changed, 43 insertions(+), 8 deletions(-)
>>
>>
>>> I have only reproed this behavior on six-core/twelve-thread systems. I would assume that at least a two-core system would be needed to repro this bug, and perhaps more. The bug will not repro if the system is not relatively idle, since a background CPU hog will force the frequency up.
>>>
>>> The repro is exquisitely simple -- ExprCount() is a simplified version of the repro (portable looping in a shell script) and BashCount() is an optimized and less portable version that runs far faster and also avoids this power management problem -- the CPU frequency is raised appropriately. Running a busy loop in another process is another way to get the frequency up and this makes ExprCount() run ~3x faster. Here is the script:
>>>
>>> --------------------------------------
>>> #!/bin/bash
>>> function ExprCount() {
>>> i=$1
>>> while [ $i -gt 0 ]; do
>>> i=$(expr $i - 1)
>>
>> I may be wrong but one cpu is used to run this script and other one would be used to run expr program.. So, 2 cpus should be good enough to reproduce this setup.
>>
>> BTW, i have tried your scripts and was able to reproduce the setup
>> here on a 2 cpu
>> 4 thread system.
>>
>> --
>> viresh
=== Highlights ===
* My Linaro connect talk was covered on lwn here:
https://lwn.net/Articles/542466/
* Worked with tglx to try to sort out remaining issues on reducing
timekeeping lock hold time
* Got my CLOCK_TAI patches properly tested and ready for 3.10
* Got a bunch of community timekeeping changes queued for 3.10
* Re-reported git server errors folks in the community were seeing w/
our trees.
* Sent out weekly email Android Upstreaming status report (canned
hangout, since we all were in HK).
* Reviewed Mathieu's sysrq timeout patch.
* Pointed Axel at ioctl work being done in trinity, and he pointed out
that others are extending it for Android testing.
* Preped and setup interviews with two Linaro candidates.
* Looked over DmitryP's netfilter patch queue, and sent back some minor
comments.
* Got the linaro.android branch updated to 3.9, using the
experimental/android-3.9 branch on AOSP. Thanks to Tixy for helping with
testing!
* Took a look at the ion driver to see what shape it was in for
upstreaming. It has some arm-specific assumptions, so it may need some
rework.
* Synced with Greg and AndrzejP on the removal of CCG from staging
(since its basically abandoned for the configfs gadget). Benoit from
Google chimed in as well here.
* Did an initial review of Serban's binder patches
=== Plans ===
* Sort out Linaro Connect expense reporting
* Another Linaro candidate interview.
* Review new volatile range patches from Minchan
* Try to finish up timekeeping lock hold time reductions
* Send pull request to tglx for my 3.10 queue
* Work on earlysuspend blog post
* Maybe send out pstore updates from Google?
=== Issues ===
* Jet lag returning from Linaro Connect was pretty bad. Think I'm
finally over it though.
On Sun, Mar 17, 2013 at 6:37 AM, Bruce Dawson <bruced(a)valvesoftware.com> wrote:
> Dave/others, I've come up with a simple (and real) scenario where a CPU bound task running on Ubuntu (and presumably other Linux flavors) fails to be detected as CPU bound by the Linux kernel, meaning that the CPU continues to run at low speed, meaning that this CPU bound task takes (on my machines) about three times longer to run than it should.
>
> I found these e-mail addresses in the MAINTAINERS list under CPU FREQUENCY DRIVERS which I'm hoping is the correct area.
Yes, cpufreq mailing list is the right list for this.
> The basic problem is that on a multi-core system if you run a shell script that spawns lots of sub processes then the workload ends up distributed across all of the CPUs. Therefore, since none of the CPUs are particularly busy, the Linux kernel doesn't realize that a CPU bound task is running, so it leaves the CPU frequency set to low. I have confirmed the behavior in multiple ways. Specifically, I have used "iostat 1" and "mpstat -P ALL 1" to confirm that a full core's worth of CPU work is being done. mpstat also showed that the work was distributed across multiple cores. Using the zoom profiler UI for perf showed the sub processes (and bash) being spread across multiple cores, and perf stat showed that the CPU frequency was staying low even though the task was CPU bound.
There are few things which would be helpful to understand what's going on.
What governor is used in your case? Probably Ondemand (My ubuntu uses this).
Ideally, cpu frequency is increased only if cpu load is very high (or
above threshold,
95 in my ubuntu). So, the frequency might not be increased if there
are multiple cpus
running for a specific task and none of them has high enough load at that time.
Other stuff that i suspect here is a bug which was solved recently by
below patch. If
policy->cpu (that might be cpu 0 for you) is sleeping, then load is
never evaluated even
if all other cpus are very busy. If you can try below patch then it
might be helpful. BTW,
you might not be able to apply it easily as it has got lots of
dependencies.. and so you
might need to pick all drivers/cpufreq patches from v3.9-rc1.
commit 2abfa876f1117b0ab45f191fb1f82c41b1cbc8fe
Author: Rickard Andersson <rickard.andersson(a)stericsson.com>
Date: Thu Dec 27 14:55:38 2012 +0000
cpufreq: handle SW coordinated CPUs
This patch fixes a bug that occurred when we had load on a secondary CPU
and the primary CPU was sleeping. Only one sampling timer was spawned
and it was spawned as a deferred timer on the primary CPU, so when a
secondary CPU had a change in load this was not detected by the cpufreq
governor (both ondemand and conservative).
This patch make sure that deferred timers are run on all CPUs in the
case of software controlled CPUs that run on the same frequency.
Signed-off-by: Rickard Andersson <rickard.andersson(a)stericsson.com>
Signed-off-by: Fabio Baltieri <fabio.baltieri(a)linaro.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki(a)intel.com>
---
drivers/cpufreq/cpufreq_conservative.c | 3 ++-
drivers/cpufreq/cpufreq_governor.c | 44
++++++++++++++++++++++++++++++++++++++------
drivers/cpufreq/cpufreq_governor.h | 1 +
drivers/cpufreq/cpufreq_ondemand.c | 3 ++-
4 files changed, 43 insertions(+), 8 deletions(-)
> I have only reproed this behavior on six-core/twelve-thread systems. I would assume that at least a two-core system would be needed to repro this bug, and perhaps more. The bug will not repro if the system is not relatively idle, since a background CPU hog will force the frequency up.
>
> The repro is exquisitely simple -- ExprCount() is a simplified version of the repro (portable looping in a shell script) and BashCount() is an optimized and less portable version that runs far faster and also avoids this power management problem -- the CPU frequency is raised appropriately. Running a busy loop in another process is another way to get the frequency up and this makes ExprCount() run ~3x faster. Here is the script:
>
> --------------------------------------
> #!/bin/bash
> function ExprCount() {
> i=$1
> while [ $i -gt 0 ]; do
> i=$(expr $i - 1)
I may be wrong but one cpu is used to run this script and other one
would be used
to run expr program.. So, 2 cpus should be good enough to reproduce this setup.
BTW, i have tried your scripts and was able to reproduce the setup
here on a 2 cpu
4 thread system.
--
viresh
=== David Long ===
=== Highlights ===
* Attended Connect. The keynotes were particulaly good this time.
Kudos to all who arranged this and those who arranged the weather (not
that I got outside much).
* Although the uprobe code is installing and taking the breakpoint
properly, it is getting lost somewhere when it goes to execute the
probed instruction out-of-line. I've tried to isolate this a couple of
different ways, but no success yet.
=== Plans ===
* Continue isolating the xol problem and restructing the emulation code.
=== Issues ===
-dl
Hi Guys,
Below are hangout upstreams of Scheduler Internals by Vincent Guittot
done in LCA13.
We have got another version of this recording that is done by some
other cameras, but
its size was 30 GB and so hard to upstream. In case you need that
please contact me.
Day 1: http://www.youtube.com/watch?v=2yzelou80JE
Day 2: http://www.youtube.com/watch?v=fN11Lltx1nQ
Thanks to Naresh for arranging for hangouts.
--
Viresh
This is to let you know that the migration of lists.linaro.org has been
successfully completed.
As per the email I sent on Wednesday, it may take some time for the new
address of the server to be seen by your computer. You can check this by
trying to connect to the web site:
http://lists.linaro.org/
If you are able to connect and you do not get an error, this means you are
connecting to the new server and you can send email to the lists.
If you experience any problems after the weekend and you find that you
still cannot connect to the server, please reply to this email to let us
know.
Regards
Philip
IT Services Manager
Linaro