linaro-kernel

linaro-kernel@lists.linaro.org

10095 discussions

Re: CPU power management bug -- CPU bound task fails to raise CPU frequency

by Vincent Guittot

The bash thread can wake up on a different CPU depending of which CPU has initiated the wake up. (CPU7 or CPU10 on your screenshoot). What is a bit less normal is why your expr tasks migrate in the middle of their execution. On 18 March 2013 20:07, Bruce Dawson <bruced(a)valvesoftware.com> wrote: > BTW, I just uploaded a screenshot of the shell script running. You can see it here: > > http://www.cygnus-software.com/images/ZoomScreenshot_croppedbig.png > > It was made using http://www.rotateright.com/zoom/. The red blocks are the bash process, the other ones are various invocations of expr. > > -----Original Message----- > From: cpufreq-owner(a)vger.kernel.org [mailto:cpufreq-owner@vger.kernel.org] On Behalf Of Bruce Dawson > Sent: Monday, March 18, 2013 10:01 AM > To: 'Vincent Guittot'; Viresh Kumar > Cc: Dave Jones; cpufreq(a)vger.kernel.org; Rafael J. Wysocki; linaro-kernel(a)lists.linaro.org > Subject: RE: CPU power management bug -- CPU bound task fails to raise CPU frequency > > I guess that makes sense for the scheduler to look for the idlest CPU in the system. That's good to know. > > I had guessed that the scheduler would do something like that and that my test would run on two cores. However I find that bash alternates between two cores, which seems odd. Additionally, each invocation of expr starts on one core and moves to another, which seems odd that each invocation lives for a ms or less. The net effect is that six or more different cores get involved. > > Anyway, it is a pathological case, so maybe it doesn't matter, but given the popularity of $(command) in shell scripts it may not be completely irrelevant either. > > -----Original Message----- > From: Vincent Guittot [mailto:vincent.guittot@linaro.org] > Sent: Monday, March 18, 2013 4:07 AM > To: Viresh Kumar > Cc: Bruce Dawson; Dave Jones; cpufreq(a)vger.kernel.org; Rafael J. Wysocki; linaro-kernel(a)lists.linaro.org > Subject: Re: CPU power management bug -- CPU bound task fails to raise CPU frequency > > On 18 March 2013 06:04, Viresh Kumar <viresh.kumar(a)linaro.org> wrote: >> Let me get in the scheduler expert from Linaro (Vincent Guittot, would >> be available after few hours) >> >> Vincent, please start reading from this mail: >> >> http://permalink.gmane.org/gmane.linux.kernel.cpufreq/9675 >> >> Now, we want to understand how to make this task perform better as >> scheduler is using multiple cpus for it and hence all are staying at >> low freqs, as load isn't enough.. > > Hi, > > Your 1st test creates a task to evaluate each expr, and the fork sequence of the scheduler looks for the idlest CPU in the system. > That's explain why your test is evenly spread on all CPUs and the average load of each CPU is below the threshold of cpufreq At the opposite, your 2nd test uses only one task which stays on one CPU and trig the frequency increase. > > I would say that the scheduler behavior is almost normal : spread to get best performance (even if in this use case, the threads run > sequentially) but you have this side effect on the cpufreq thats sees each core individually as not loaded. This example tends to push in favor of a better cooperation between scheduler and cpufreq for sharing statistics > > Vincent > >> >> -- >> viresh >> >> On 18 March 2013 10:28, Bruce Dawson <bruced(a)valvesoftware.com> wrote: >>> This is with the Ondemand governor. >>> >>> The more I ponder this the more I think that the real issue is not the frequency drivers, but the kernel scheduler. The shell script involves two processes being alive at any given time, and one process running (since bash always waits for expr to finish). Therefore the entire task should run on either one core or on two. Instead I see (from looking at thread scheduling graphed using the Zoom profiler -- http://www.rotateright.com/) that it runs on six different cores. bash alternates between two cores, and each invocation of expr is started on one core and then moves to another. Given that it seems not surprising that the CPU frequency management doesn't trigger. >>> >>>> So, the frequency might not be increased if there are multiple cpus >>>> running for a specific task and none of them has high enough load at >>>> that time >>> >>> Yep, that's what I figured. Each cpu's load is quite low -- 20% or lower -- because the work is so spread out. >>> >>> If I run the entire thing under "taskset 1" then everything runs on one core, the frequency elevation happens, and the entire task runs roughly three times faster. >>> >>> Crazy/cool. >>> >>> -----Original Message----- >>> From: viresh.linux(a)gmail.com [mailto:viresh.linux@gmail.com] On >>> Behalf Of Viresh Kumar >>> Sent: Sunday, March 17, 2013 9:32 PM >>> To: Bruce Dawson >>> Cc: Dave Jones; cpufreq(a)vger.kernel.org; Rafael J. Wysocki; >>> linaro-kernel(a)lists.linaro.org >>> Subject: Re: CPU power management bug -- CPU bound task fails to >>> raise CPU frequency >>> >>> On Sun, Mar 17, 2013 at 6:37 AM, Bruce Dawson <bruced(a)valvesoftware.com> wrote: >>>> Dave/others, I've come up with a simple (and real) scenario where a CPU bound task running on Ubuntu (and presumably other Linux flavors) fails to be detected as CPU bound by the Linux kernel, meaning that the CPU continues to run at low speed, meaning that this CPU bound task takes (on my machines) about three times longer to run than it should. >>>> >>>> I found these e-mail addresses in the MAINTAINERS list under CPU FREQUENCY DRIVERS which I'm hoping is the correct area. >>> >>> Yes, cpufreq mailing list is the right list for this. >>> >>>> The basic problem is that on a multi-core system if you run a shell script that spawns lots of sub processes then the workload ends up distributed across all of the CPUs. Therefore, since none of the CPUs are particularly busy, the Linux kernel doesn't realize that a CPU bound task is running, so it leaves the CPU frequency set to low. I have confirmed the behavior in multiple ways. Specifically, I have used "iostat 1" and "mpstat -P ALL 1" to confirm that a full core's worth of CPU work is being done. mpstat also showed that the work was distributed across multiple cores. Using the zoom profiler UI for perf showed the sub processes (and bash) being spread across multiple cores, and perf stat showed that the CPU frequency was staying low even though the task was CPU bound. >>> >>> There are few things which would be helpful to understand what's going on. >>> What governor is used in your case? Probably Ondemand (My ubuntu uses this). >>> >>> Ideally, cpu frequency is increased only if cpu load is very high (or >>> above threshold, >>> 95 in my ubuntu). So, the frequency might not be increased if there are multiple cpus running for a specific task and none of them has high enough load at that time. >>> >>> Other stuff that i suspect here is a bug which was solved recently by >>> below patch. If >>> policy->cpu (that might be cpu 0 for you) is sleeping, then load is >>> never evaluated even >>> if all other cpus are very busy. If you can try below patch then it might be helpful. BTW, you might not be able to apply it easily as it has got lots of dependencies.. and so you might need to pick all drivers/cpufreq patches from v3.9-rc1. >>> >>> commit 2abfa876f1117b0ab45f191fb1f82c41b1cbc8fe >>> Author: Rickard Andersson <rickard.andersson(a)stericsson.com> >>> Date: Thu Dec 27 14:55:38 2012 +0000 >>> >>> cpufreq: handle SW coordinated CPUs >>> >>> This patch fixes a bug that occurred when we had load on a secondary CPU >>> and the primary CPU was sleeping. Only one sampling timer was spawned >>> and it was spawned as a deferred timer on the primary CPU, so when a >>> secondary CPU had a change in load this was not detected by the cpufreq >>> governor (both ondemand and conservative). >>> >>> This patch make sure that deferred timers are run on all CPUs in the >>> case of software controlled CPUs that run on the same frequency. >>> >>> Signed-off-by: Rickard Andersson <rickard.andersson(a)stericsson.com> >>> Signed-off-by: Fabio Baltieri <fabio.baltieri(a)linaro.org> >>> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki(a)intel.com> >>> --- >>> drivers/cpufreq/cpufreq_conservative.c | 3 ++- >>> drivers/cpufreq/cpufreq_governor.c | 44 >>> ++++++++++++++++++++++++++++++++++++++------ >>> drivers/cpufreq/cpufreq_governor.h | 1 + >>> drivers/cpufreq/cpufreq_ondemand.c | 3 ++- >>> 4 files changed, 43 insertions(+), 8 deletions(-) >>> >>> >>>> I have only reproed this behavior on six-core/twelve-thread systems. I would assume that at least a two-core system would be needed to repro this bug, and perhaps more. The bug will not repro if the system is not relatively idle, since a background CPU hog will force the frequency up. >>>> >>>> The repro is exquisitely simple -- ExprCount() is a simplified version of the repro (portable looping in a shell script) and BashCount() is an optimized and less portable version that runs far faster and also avoids this power management problem -- the CPU frequency is raised appropriately. Running a busy loop in another process is another way to get the frequency up and this makes ExprCount() run ~3x faster. Here is the script: >>>> >>>> -------------------------------------- >>>> #!/bin/bash >>>> function ExprCount() { >>>> i=$1 >>>> while [ $i -gt 0 ]; do >>>> i=$(expr $i - 1) >>> >>> I may be wrong but one cpu is used to run this script and other one would be used to run expr program.. So, 2 cpus should be good enough to reproduce this setup. >>> >>> BTW, i have tried your scripts and was able to reproduce the setup >>> here on a 2 cpu >>> 4 thread system. >>> >>> -- >>> viresh > -- > To unsubscribe from this list: send the line "unsubscribe cpufreq" in the body of a message to majordomo(a)vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html

12 years, 7 months

[PATCH] drm/exynos: enable FIMD clocks

by Vikas Sajjan

While migrating to common clock framework (CCF), found that the FIMD clocks were pulled down by the CCF. If CCF finds any clock(s) which has NOT been claimed by any of the drivers, then such clock(s) are PULLed low by CCF. By calling clk_prepare_enable() for FIMD clocks fixes the issue. Signed-off-by: Vikas Sajjan <vikas.sajjan(a)linaro.org> --- drivers/gpu/drm/exynos/exynos_drm_fimd.c | 3 +++ 1 file changed, 3 insertions(+) diff --git a/drivers/gpu/drm/exynos/exynos_drm_fimd.c b/drivers/gpu/drm/exynos/exynos_drm_fimd.c index 9537761..d93dd8a 100644 --- a/drivers/gpu/drm/exynos/exynos_drm_fimd.c +++ b/drivers/gpu/drm/exynos/exynos_drm_fimd.c @@ -934,6 +934,9 @@ static int fimd_probe(struct platform_device *pdev) return ret; } + clk_prepare_enable(ctx->lcd_clk); + clk_prepare_enable(ctx->bus_clk); + ctx->vidcon0 = pdata->vidcon0; ctx->vidcon1 = pdata->vidcon1; ctx->default_win = pdata->default_win; -- 1.7.9.5

12 years, 7 months

[PATCH 0/4][V2] time: dynamic irq affinity

by Daniel Lezcano

When a cpu goes to a deep idle state where its local timer is shutdown, it notifies the time framework to use the broadcast timer instead. Unfortunately, the broadcast device could wake up any CPU, including an idle one which is not concerned by the wake up at all. This implies, in the worst case, an idle CPU will wake up to send an IPI to another idle cpu. This patch solves this by setting the irq affinity to the cpu concerned by the nearest timer event, by this way, the CPU which is wake up is guarantee to be the one concerned by the next event and we are safe with unnecessary wakeup for another idle CPU. As the irq affinity is not supported by all the archs, a flag is needed to specify which clocksource can handle it : CLOCK_EVT_FEAT_DYNIRQ. Tested on a u8500 board with a test program doing indefinitely usleep 10000 wired on each CPU. With dynamic irq affinity: Log is 10.042298 secs long with 4190 events cpu0/state0, 24 hits, total 2718.00us, avg 113.25us, min 0.00us, max 854.00us cpu0/state1, 994 hits, total 9874827.00us, avg 9934.43us, min 30.00us, max 10346.00us cpu1/state0, 73 hits, total 17001.00us, avg 232.89us, min 0.00us, max 10040.00us cpu1/state1, 1002 hits, total 9883840.00us, avg 9864.11us, min 0.00us, max 10742.00us cluster/state0, 0 hits, total 0.00us, avg 0.00us, min 0.00us, max 0.00us cluster/state1, 1931 hits, total 9762328.00us, avg 5055.58us, min 30.00us, max 9308.00us Without dynamic irq affinity: Log is 10.036834 secs long with 6574 events cpu0/state0, 114 hits, total 20107.00us, avg 176.38us, min 0.00us, max 7233.00us cpu0/state1, 1951 hits, total 9833836.00us, avg 5040.41us, min 0.00us, max 9217.00us cpu1/state0, 223 hits, total 21140.00us, avg 94.80us, min 0.00us, max 2960.00us cpu1/state1, 997 hits, total 9879748.00us, avg 9909.48us, min 0.00us, max 10346.00us cluster/state0, 5 hits, total 5462.00us, avg 1092.40us, min 580.00us, max 2899.00us cluster/state1, 2298 hits, total 9740988.00us, avg 4238.90us, min 30.00us, max 9217.00us Results for the specific test case 'usleep 10000' * reduced by 40% the number of wake up on the system * reduced by 49% the number of wake up for CPU0 * increased by factor two idle time for CPU0 * increase by 16% package idle hits + 16% average package idle time Changelog: V2 : * mentioned CLOCK_EVT_FEAT_DYNIRQ flag name in patch description * added comments for CLOCK_EVT_FEAT_DYNIRQ * replaced tick_broadcast_set_affinity parameter to use a cpumask V1 : initial post Daniel Lezcano (3): time : pass broadcast parameter time : set broadcast irq affinity ARM: nomadik: add dynamic irq flag to the timer Viresh Kumar (1): ARM: timer-sp: Set dynamic irq affinity arch/arm/common/timer-sp.c | 3 ++- drivers/clocksource/nomadik-mtu.c | 3 ++- include/linux/clockchips.h | 5 +++++ kernel/time/tick-broadcast.c | 41 +++++++++++++++++++++++++++++-------- 4 files changed, 42 insertions(+), 10 deletions(-) -- 1.7.9.5

12 years, 7 months

[PATCH v5] sched: fix init NOHZ_IDLE flag

by Vincent Guittot

On my smp platform which is made of 5 cores in 2 clusters, I have the nr_busy_cpu field of sched_group_power struct that is not null when the platform is fully idle. The root cause is: During the boot sequence, some CPUs reach the idle loop and set their NOHZ_IDLE flag while waiting for others CPUs to boot. But the nr_busy_cpus field is initialized later with the assumption that all CPUs are in the busy state whereas some CPUs have already set their NOHZ_IDLE flag. More generally, the NOHZ_IDLE flag must be initialized when new sched_domains are created in order to ensure that NOHZ_IDLE and nr_busy_cpus are aligned. This condition can be ensured by adding a synchronize_rcu between the destruction of old sched_domains and the creation of new ones so the NOHZ_IDLE flag will not be updated with old sched_domain once it has been initialized. But this solution introduces a additionnal latency in the rebuild sequence that is called during cpu hotplug. As suggested by Frederic Weisbecker, another solution is to have the same rcu lifecycle for both NOHZ_IDLE and sched_domain struct. I have introduce a new sched_domain_rq struct that is the entry point for both sched_domains and objects that must follow the same lifecycle like NOHZ_IDLE flags. They will share the same RCU lifecycle and will be always synchronized. The synchronization is done at the cost of : - an additional indirection for accessing the first sched_domain level - an additional indirection and a rcu_dereference before accessing to the NOHZ_IDLE flag. Change since v4: - link both sched_domain and NOHZ_IDLE flag in one RCU object so their states are always synchronized. Change since V3; - NOHZ flag is not cleared if a NULL domain is attached to the CPU - Remove patch 2/2 which becomes useless with latest modifications Change since V2: - change the initialization to idle state instead of busy state so a CPU that enters idle during the build of the sched_domain will not corrupt the initialization state Change since V1: - remove the patch for SCHED softirq on an idle core use case as it was a side effect of the other use cases. Signed-off-by: Vincent Guittot <vincent.guittot(a)linaro.org> --- include/linux/sched.h | 6 +++ kernel/sched/core.c | 105 ++++++++++++++++++++++++++++++++++++++++++++----- kernel/sched/fair.c | 35 +++++++++++------ kernel/sched/sched.h | 24 +++++++++-- 4 files changed, 145 insertions(+), 25 deletions(-) diff --git a/include/linux/sched.h b/include/linux/sched.h index d35d2b6..2a52188 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -959,6 +959,12 @@ struct sched_domain { unsigned long span[0]; }; +struct sched_domain_rq { + struct sched_domain *sd; + unsigned long flags; + struct rcu_head rcu; /* used during destruction */ +}; + static inline struct cpumask *sched_domain_span(struct sched_domain *sd) { return to_cpumask(sd->span); diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 7f12624..69e2313 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -5602,6 +5602,15 @@ static void destroy_sched_domains(struct sched_domain *sd, int cpu) destroy_sched_domain(sd, cpu); } +static void destroy_sched_domain_rq(struct sched_domain_rq *sd_rq, int cpu) +{ + if (!sd_rq) + return; + + destroy_sched_domains(sd_rq->sd, cpu); + kfree_rcu(sd_rq, rcu); +} + /* * Keep a special pointer to the highest sched_domain that has * SD_SHARE_PKG_RESOURCE set (Last Level Cache Domain) for this @@ -5632,10 +5641,23 @@ static void update_top_cache_domain(int cpu) * hold the hotplug lock. */ static void -cpu_attach_domain(struct sched_domain *sd, struct root_domain *rd, int cpu) +cpu_attach_domain(struct sched_domain_rq *sd_rq, struct root_domain *rd, + int cpu) { struct rq *rq = cpu_rq(cpu); - struct sched_domain *tmp; + struct sched_domain_rq *tmp_rq; + struct sched_domain *tmp, *sd = NULL; + + /* + * If we don't have any sched_domain and associated object, we can + * directly jump to the attach sequence otherwise we try to degenerate + * the sched_domain + */ + if (!sd_rq) + goto attach; + + /* Get a pointer to the 1st sched_domain */ + sd = sd_rq->sd; /* Remove the sched domains which do not contribute to scheduling. */ for (tmp = sd; tmp; ) { @@ -5658,14 +5680,17 @@ cpu_attach_domain(struct sched_domain *sd, struct root_domain *rd, int cpu) destroy_sched_domain(tmp, cpu); if (sd) sd->child = NULL; + /* update sched_domain_rq */ + sd_rq->sd = sd; } +attach: sched_domain_debug(sd, cpu); rq_attach_root(rq, rd); - tmp = rq->sd; - rcu_assign_pointer(rq->sd, sd); - destroy_sched_domains(tmp, cpu); + tmp_rq = rq->sd_rq; + rcu_assign_pointer(rq->sd_rq, sd_rq); + destroy_sched_domain_rq(tmp_rq, cpu); update_top_cache_domain(cpu); } @@ -5695,12 +5720,14 @@ struct sd_data { }; struct s_data { + struct sched_domain_rq ** __percpu sd_rq; struct sched_domain ** __percpu sd; struct root_domain *rd; }; enum s_alloc { sa_rootdomain, + sa_sd_rq, sa_sd, sa_sd_storage, sa_none, @@ -5935,7 +5962,7 @@ static void init_sched_groups_power(int cpu, struct sched_domain *sd) return; update_group_power(sd, cpu); - atomic_set(&sg->sgp->nr_busy_cpus, sg->group_weight); + atomic_set(&sg->sgp->nr_busy_cpus, 0); } int __weak arch_sd_sibling_asym_packing(void) @@ -6011,6 +6038,8 @@ static void set_domain_attribute(struct sched_domain *sd, static void __sdt_free(const struct cpumask *cpu_map); static int __sdt_alloc(const struct cpumask *cpu_map); +static void __sdrq_free(const struct cpumask *cpu_map, struct s_data *d); +static int __sdrq_alloc(const struct cpumask *cpu_map, struct s_data *d); static void __free_domain_allocs(struct s_data *d, enum s_alloc what, const struct cpumask *cpu_map) @@ -6019,6 +6048,9 @@ static void __free_domain_allocs(struct s_data *d, enum s_alloc what, case sa_rootdomain: if (!atomic_read(&d->rd->refcount)) free_rootdomain(&d->rd->rcu); /* fall through */ + case sa_sd_rq: + __sdrq_free(cpu_map, d); /* fall through */ + free_percpu(d->sd_rq); /* fall through */ case sa_sd: free_percpu(d->sd); /* fall through */ case sa_sd_storage: @@ -6038,9 +6070,14 @@ static enum s_alloc __visit_domain_allocation_hell(struct s_data *d, d->sd = alloc_percpu(struct sched_domain *); if (!d->sd) return sa_sd_storage; + d->sd_rq = alloc_percpu(struct sched_domain_rq *); + if (!d->sd_rq) + return sa_sd; + if (__sdrq_alloc(cpu_map, d)) + return sa_sd_rq; d->rd = alloc_rootdomain(); if (!d->rd) - return sa_sd; + return sa_sd_rq; return sa_rootdomain; } @@ -6466,6 +6503,46 @@ static void __sdt_free(const struct cpumask *cpu_map) } } +static int __sdrq_alloc(const struct cpumask *cpu_map, struct s_data *d) +{ + int j; + + for_each_cpu(j, cpu_map) { + struct sched_domain_rq *sd_rq; + + sd_rq = kzalloc_node(sizeof(struct sched_domain_rq), + GFP_KERNEL, cpu_to_node(j)); + if (!sd_rq) + return -ENOMEM; + + *per_cpu_ptr(d->sd_rq, j) = sd_rq; + } + + return 0; +} + +static void __sdrq_free(const struct cpumask *cpu_map, struct s_data *d) +{ + int j; + + for_each_cpu(j, cpu_map) + if (*per_cpu_ptr(d->sd_rq, j)) + kfree(*per_cpu_ptr(d->sd_rq, j)); +} + +static void build_sched_domain_rq(struct s_data *d, int cpu) +{ + struct sched_domain_rq *sd_rq; + struct sched_domain *sd; + + /* Attach sched_domain to sched_domain_rq */ + sd = *per_cpu_ptr(d->sd, cpu); + sd_rq = *per_cpu_ptr(d->sd_rq, cpu); + sd_rq->sd = sd; + /* Init flags */ + set_bit(NOHZ_IDLE, sched_rq_flags(sd_rq)); +} + struct sched_domain *build_sched_domain(struct sched_domain_topology_level *tl, struct s_data *d, const struct cpumask *cpu_map, struct sched_domain_attr *attr, struct sched_domain *child, @@ -6495,6 +6572,7 @@ static int build_sched_domains(const struct cpumask *cpu_map, struct sched_domain_attr *attr) { enum s_alloc alloc_state = sa_none; + struct sched_domain_rq *sd_rq; struct sched_domain *sd; struct s_data d; int i, ret = -ENOMEM; @@ -6547,11 +6625,18 @@ static int build_sched_domains(const struct cpumask *cpu_map, } } + /* Init objects that must follow the sched_domain lifecycle */ + for_each_cpu(i, cpu_map) { + build_sched_domain_rq(&d, i); + } + /* Attach the domains */ rcu_read_lock(); for_each_cpu(i, cpu_map) { - sd = *per_cpu_ptr(d.sd, i); - cpu_attach_domain(sd, d.rd, i); + sd_rq = *per_cpu_ptr(d.sd_rq, i); + cpu_attach_domain(sd_rq, d.rd, i); + /* claim allocation of sched_domain_rq object */ + *per_cpu_ptr(d.sd_rq, i) = NULL; } rcu_read_unlock(); @@ -6982,7 +7067,7 @@ void __init sched_init(void) rq->last_load_update_tick = jiffies; #ifdef CONFIG_SMP - rq->sd = NULL; + rq->sd_rq = NULL; rq->rd = NULL; rq->cpu_power = SCHED_POWER_SCALE; rq->post_schedule = 0; diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 7a33e59..1c7447e 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -5392,31 +5392,39 @@ static inline void nohz_balance_exit_idle(int cpu) static inline void set_cpu_sd_state_busy(void) { + struct sched_domain_rq *sd_rq; struct sched_domain *sd; int cpu = smp_processor_id(); - if (!test_bit(NOHZ_IDLE, nohz_flags(cpu))) - return; - clear_bit(NOHZ_IDLE, nohz_flags(cpu)); - rcu_read_lock(); - for_each_domain(cpu, sd) + sd_rq = get_sched_domain_rq(cpu); + + if (!sd_rq || !test_bit(NOHZ_IDLE, sched_rq_flags(sd_rq))) + goto unlock; + clear_bit(NOHZ_IDLE, sched_rq_flags(sd_rq)); + + for_each_domain_from_rq(sd_rq, sd) atomic_inc(&sd->groups->sgp->nr_busy_cpus); +unlock: rcu_read_unlock(); } void set_cpu_sd_state_idle(void) { + struct sched_domain_rq *sd_rq; struct sched_domain *sd; int cpu = smp_processor_id(); - if (test_bit(NOHZ_IDLE, nohz_flags(cpu))) - return; - set_bit(NOHZ_IDLE, nohz_flags(cpu)); - rcu_read_lock(); - for_each_domain(cpu, sd) + sd_rq = get_sched_domain_rq(cpu); + + if (!sd_rq || test_bit(NOHZ_IDLE, sched_rq_flags(sd_rq))) + goto unlock; + set_bit(NOHZ_IDLE, sched_rq_flags(sd_rq)); + + for_each_domain_from_rq(sd_rq, sd) atomic_dec(&sd->groups->sgp->nr_busy_cpus); +unlock: rcu_read_unlock(); } @@ -5673,7 +5681,12 @@ static void run_rebalance_domains(struct softirq_action *h) static inline int on_null_domain(int cpu) { - return !rcu_dereference_sched(cpu_rq(cpu)->sd); + struct sched_domain_rq *sd_rq = + rcu_dereference_sched(cpu_rq(cpu)->sd_rq); + struct sched_domain *sd = NULL; + if (sd_rq) + sd = sd_rq->sd; + return !sd; } /* diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index cc03cfd..f589306 100644 --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -417,7 +417,7 @@ struct rq { #ifdef CONFIG_SMP struct root_domain *rd; - struct sched_domain *sd; + struct sched_domain_rq *sd_rq; unsigned long cpu_power; @@ -505,21 +505,37 @@ DECLARE_PER_CPU(struct rq, runqueues); #ifdef CONFIG_SMP -#define rcu_dereference_check_sched_domain(p) \ +#define rcu_dereference_check_sched_domain_rq(p) \ rcu_dereference_check((p), \ lockdep_is_held(&sched_domains_mutex)) +#define get_sched_domain_rq(cpu) \ + rcu_dereference_check_sched_domain_rq(cpu_rq(cpu)->sd_rq) + +#define rcu_dereference_check_sched_domain(cpu) ({ \ + struct sched_domain_rq *__sd_rq = get_sched_domain_rq(cpu); \ + struct sched_domain *__sd = NULL; \ + if (__sd_rq) \ + __sd = __sd_rq->sd; \ + __sd; \ +}) + +#define sched_rq_flags(sd_rq) (&sd_rq->flags) + /* - * The domain tree (rq->sd) is protected by RCU's quiescent state transition. + * The domain tree (rq->sd_rq) is protected by RCU's quiescent state transition. * See detach_destroy_domains: synchronize_sched for details. * * The domain tree of any CPU may only be accessed from within * preempt-disabled sections. */ #define for_each_domain(cpu, __sd) \ - for (__sd = rcu_dereference_check_sched_domain(cpu_rq(cpu)->sd); \ + for (__sd = rcu_dereference_check_sched_domain(cpu); \ __sd; __sd = __sd->parent) +#define for_each_domain_from_rq(sd_rq, __sd) \ + for (__sd = sd_rq->sd; __sd; __sd = __sd->parent) + #define for_each_lower_domain(sd) for (; sd; sd = sd->child) /** -- 1.7.9.5

12 years, 7 months

[PATCH v4 0/3] Add ST-Ericsson AB8500 HWMON driver

by Hongbo Zhang

Guenter, Please check this v4 patches, thanks for all your reviews/comments for this patch set. Anton Vorontsov, I have add your Acked-by: into patch [1/3]; but for this v4 [2/3], we have to change it much according to Guenter's feedback and our internal discussions, so please have a look at this new [2/3] again, thank you. v3 -> v4 changes: for patch [3/3] - define delays in HZ - update ab8500_read_sensor function, returning temp by parameter - remove ab8500_is_visible function - use clamp_val in set_min and set_max callback - remove unnecessary locks in remove and suspend functions - let abx500 and ab8500 use its own data structure for patch [2/3] - move the data tables from driver/power/ab8500_bmdata.c to include/linux/power/ab8500.h - rename driver/power/ab8500_bmdata.c to driver/power/ab8500_bm.c - rename these variable names to eliminate CamelCase warnings - add const attribute to these data v2 -> v3 changes: - Add interface for converting voltage to temperature - Remove temp5 sensor since we cannot offer temperature read interface of it - Update hyst to use absolute temperature instead of a difference - Add the 3/3 patch v1 -> v2 changes: - Add Documentation/hwmon/abx500 and Documentation/hwmon/abx500 - Make devices which cannot report milli-Celsius invisible - Add temp5_crit interface - Re-work the old find_active_thresholds() to threshold_updated() - Reset updated_min_alarm and updated_max_alarm at the end of each loop - Update the hyst mechamisn to make it works as real hyst - Remove non-stand attributes - Re-order the operations sequence inside probe and remove functions - Update all the lock usages to eliminate race conditions - Make attibutes index starts from 0 also changes: - Since the old [1/2] "ARM: ux500: rename ab8500 to abx500 for hwmon driver" has been merged by Samuel, so won't send it again. - Add another new patch "ab8500_btemp: export two symblols" as [2/2] of this patch set. Hongbo Zhang (3): ab8500_btemp: make ab8500_btemp_get* interfaces public ab8500: re-arrange ab8500 power and temperature data tables hwmon: add ST-Ericsson ABX500 hwmon driver Documentation/hwmon/ab8500 | 22 ++ Documentation/hwmon/abx500 | 28 ++ drivers/hwmon/Kconfig | 13 + drivers/hwmon/Makefile | 1 + drivers/hwmon/ab8500.c | 208 ++++++++++++++ drivers/hwmon/abx500.c | 494 +++++++++++++++++++++++++++++++++ drivers/hwmon/abx500.h | 69 +++++ drivers/power/Makefile | 2 +- drivers/power/ab8500_bm.c | 341 +++++++++++++++++++++++ drivers/power/ab8500_bmdata.c | 519 ----------------------------------- drivers/power/ab8500_btemp.c | 5 +- drivers/power/ab8500_fg.c | 4 +- include/linux/mfd/abx500.h | 6 +- include/linux/mfd/abx500/ab8500-bm.h | 5 + include/linux/power/ab8500.h | 189 +++++++++++++ 15 files changed, 1380 insertions(+), 526 deletions(-) create mode 100644 Documentation/hwmon/ab8500 create mode 100644 Documentation/hwmon/abx500 create mode 100644 drivers/hwmon/ab8500.c create mode 100644 drivers/hwmon/abx500.c create mode 100644 drivers/hwmon/abx500.h create mode 100644 drivers/power/ab8500_bm.c delete mode 100644 drivers/power/ab8500_bmdata.c create mode 100644 include/linux/power/ab8500.h -- 1.8.0

12 years, 7 months

RE: CPU power management bug -- CPU bound task fails to raise CPU frequency

by Viresh Kumar

---------- Forwarded message ---------- From: <linaro-kernel-bounces(a)lists.linaro.org> Date: 18 March 2013 22:31 Subject: Auto-discard notification To: linaro-kernel-owner(a)lists.linaro.org The attached message has been automatically discarded by lists.linaro.org and hence i am forwarding it again. -- viresh ---------- Forwarded message ---------- From: Bruce Dawson <bruced(a)valvesoftware.com> To: 'Vincent Guittot' <vincent.guittot(a)linaro.org>, Viresh Kumar <viresh.kumar(a)linaro.org> Cc: Dave Jones <davej(a)redhat.com>, "cpufreq(a)vger.kernel.org" <cpufreq(a)vger.kernel.org>, "Rafael J. Wysocki" <rjw(a)sisk.pl>, "linaro-kernel(a)lists.linaro.org" <linaro-kernel(a)lists.linaro.org> Date: Mon, 18 Mar 2013 17:01:17 +0000 Subject: RE: CPU power management bug -- CPU bound task fails to raise CPU frequency I guess that makes sense for the scheduler to look for the idlest CPU in the system. That's good to know. I had guessed that the scheduler would do something like that and that my test would run on two cores. However I find that bash alternates between two cores, which seems odd. Additionally, each invocation of expr starts on one core and moves to another, which seems odd that each invocation lives for a ms or less. The net effect is that six or more different cores get involved. Anyway, it is a pathological case, so maybe it doesn't matter, but given the popularity of $(command) in shell scripts it may not be completely irrelevant either. -----Original Message----- From: Vincent Guittot [mailto:vincent.guittot@linaro.org] Sent: Monday, March 18, 2013 4:07 AM To: Viresh Kumar Cc: Bruce Dawson; Dave Jones; cpufreq(a)vger.kernel.org; Rafael J. Wysocki; linaro-kernel(a)lists.linaro.org Subject: Re: CPU power management bug -- CPU bound task fails to raise CPU frequency On 18 March 2013 06:04, Viresh Kumar <viresh.kumar(a)linaro.org> wrote: > Let me get in the scheduler expert from Linaro (Vincent Guittot, would > be available after few hours) > > Vincent, please start reading from this mail: > > http://permalink.gmane.org/gmane.linux.kernel.cpufreq/9675 > > Now, we want to understand how to make this task perform better as > scheduler is using multiple cpus for it and hence all are staying at > low freqs, as load isn't enough.. Hi, Your 1st test creates a task to evaluate each expr, and the fork sequence of the scheduler looks for the idlest CPU in the system. That's explain why your test is evenly spread on all CPUs and the average load of each CPU is below the threshold of cpufreq At the opposite, your 2nd test uses only one task which stays on one CPU and trig the frequency increase. I would say that the scheduler behavior is almost normal : spread to get best performance (even if in this use case, the threads run sequentially) but you have this side effect on the cpufreq thats sees each core individually as not loaded. This example tends to push in favor of a better cooperation between scheduler and cpufreq for sharing statistics Vincent > > -- > viresh > > On 18 March 2013 10:28, Bruce Dawson <bruced(a)valvesoftware.com> wrote: >> This is with the Ondemand governor. >> >> The more I ponder this the more I think that the real issue is not the frequency drivers, but the kernel scheduler. The shell script involves two processes being alive at any given time, and one process running (since bash always waits for expr to finish). Therefore the entire task should run on either one core or on two. Instead I see (from looking at thread scheduling graphed using the Zoom profiler -- http://www.rotateright.com/) that it runs on six different cores. bash alternates between two cores, and each invocation of expr is started on one core and then moves to another. Given that it seems not surprising that the CPU frequency management doesn't trigger. >> >>> So, the frequency might not be increased if there are multiple cpus >>> running for a specific task and none of them has high enough load at >>> that time >> >> Yep, that's what I figured. Each cpu's load is quite low -- 20% or lower -- because the work is so spread out. >> >> If I run the entire thing under "taskset 1" then everything runs on one core, the frequency elevation happens, and the entire task runs roughly three times faster. >> >> Crazy/cool. >> >> -----Original Message----- >> From: viresh.linux(a)gmail.com [mailto:viresh.linux@gmail.com] On >> Behalf Of Viresh Kumar >> Sent: Sunday, March 17, 2013 9:32 PM >> To: Bruce Dawson >> Cc: Dave Jones; cpufreq(a)vger.kernel.org; Rafael J. Wysocki; >> linaro-kernel(a)lists.linaro.org >> Subject: Re: CPU power management bug -- CPU bound task fails to >> raise CPU frequency >> >> On Sun, Mar 17, 2013 at 6:37 AM, Bruce Dawson <bruced(a)valvesoftware.com> wrote: >>> Dave/others, I've come up with a simple (and real) scenario where a CPU bound task running on Ubuntu (and presumably other Linux flavors) fails to be detected as CPU bound by the Linux kernel, meaning that the CPU continues to run at low speed, meaning that this CPU bound task takes (on my machines) about three times longer to run than it should. >>> >>> I found these e-mail addresses in the MAINTAINERS list under CPU FREQUENCY DRIVERS which I'm hoping is the correct area. >> >> Yes, cpufreq mailing list is the right list for this. >> >>> The basic problem is that on a multi-core system if you run a shell script that spawns lots of sub processes then the workload ends up distributed across all of the CPUs. Therefore, since none of the CPUs are particularly busy, the Linux kernel doesn't realize that a CPU bound task is running, so it leaves the CPU frequency set to low. I have confirmed the behavior in multiple ways. Specifically, I have used "iostat 1" and "mpstat -P ALL 1" to confirm that a full core's worth of CPU work is being done. mpstat also showed that the work was distributed across multiple cores. Using the zoom profiler UI for perf showed the sub processes (and bash) being spread across multiple cores, and perf stat showed that the CPU frequency was staying low even though the task was CPU bound. >> >> There are few things which would be helpful to understand what's going on. >> What governor is used in your case? Probably Ondemand (My ubuntu uses this). >> >> Ideally, cpu frequency is increased only if cpu load is very high (or >> above threshold, >> 95 in my ubuntu). So, the frequency might not be increased if there are multiple cpus running for a specific task and none of them has high enough load at that time. >> >> Other stuff that i suspect here is a bug which was solved recently by >> below patch. If >> policy->cpu (that might be cpu 0 for you) is sleeping, then load is >> never evaluated even >> if all other cpus are very busy. If you can try below patch then it might be helpful. BTW, you might not be able to apply it easily as it has got lots of dependencies.. and so you might need to pick all drivers/cpufreq patches from v3.9-rc1. >> >> commit 2abfa876f1117b0ab45f191fb1f82c41b1cbc8fe >> Author: Rickard Andersson <rickard.andersson(a)stericsson.com> >> Date: Thu Dec 27 14:55:38 2012 +0000 >> >> cpufreq: handle SW coordinated CPUs >> >> This patch fixes a bug that occurred when we had load on a secondary CPU >> and the primary CPU was sleeping. Only one sampling timer was spawned >> and it was spawned as a deferred timer on the primary CPU, so when a >> secondary CPU had a change in load this was not detected by the cpufreq >> governor (both ondemand and conservative). >> >> This patch make sure that deferred timers are run on all CPUs in the >> case of software controlled CPUs that run on the same frequency. >> >> Signed-off-by: Rickard Andersson <rickard.andersson(a)stericsson.com> >> Signed-off-by: Fabio Baltieri <fabio.baltieri(a)linaro.org> >> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki(a)intel.com> >> --- >> drivers/cpufreq/cpufreq_conservative.c | 3 ++- >> drivers/cpufreq/cpufreq_governor.c | 44 >> ++++++++++++++++++++++++++++++++++++++------ >> drivers/cpufreq/cpufreq_governor.h | 1 + >> drivers/cpufreq/cpufreq_ondemand.c | 3 ++- >> 4 files changed, 43 insertions(+), 8 deletions(-) >> >> >>> I have only reproed this behavior on six-core/twelve-thread systems. I would assume that at least a two-core system would be needed to repro this bug, and perhaps more. The bug will not repro if the system is not relatively idle, since a background CPU hog will force the frequency up. >>> >>> The repro is exquisitely simple -- ExprCount() is a simplified version of the repro (portable looping in a shell script) and BashCount() is an optimized and less portable version that runs far faster and also avoids this power management problem -- the CPU frequency is raised appropriately. Running a busy loop in another process is another way to get the frequency up and this makes ExprCount() run ~3x faster. Here is the script: >>> >>> -------------------------------------- >>> #!/bin/bash >>> function ExprCount() { >>> i=$1 >>> while [ $i -gt 0 ]; do >>> i=$(expr $i - 1) >> >> I may be wrong but one cpu is used to run this script and other one would be used to run expr program.. So, 2 cpus should be good enough to reproduce this setup. >> >> BTW, i have tried your scripts and was able to reproduce the setup >> here on a 2 cpu >> 4 thread system. >> >> -- >> viresh

12 years, 7 months

[ACTIVITY] (John Stultz) Mar 11-15

by John Stultz

=== Highlights === * My Linaro connect talk was covered on lwn here: https://lwn.net/Articles/542466/ * Worked with tglx to try to sort out remaining issues on reducing timekeeping lock hold time * Got my CLOCK_TAI patches properly tested and ready for 3.10 * Got a bunch of community timekeeping changes queued for 3.10 * Re-reported git server errors folks in the community were seeing w/ our trees. * Sent out weekly email Android Upstreaming status report (canned hangout, since we all were in HK). * Reviewed Mathieu's sysrq timeout patch. * Pointed Axel at ioctl work being done in trinity, and he pointed out that others are extending it for Android testing. * Preped and setup interviews with two Linaro candidates. * Looked over DmitryP's netfilter patch queue, and sent back some minor comments. * Got the linaro.android branch updated to 3.9, using the experimental/android-3.9 branch on AOSP. Thanks to Tixy for helping with testing! * Took a look at the ion driver to see what shape it was in for upstreaming. It has some arm-specific assumptions, so it may need some rework. * Synced with Greg and AndrzejP on the removal of CCG from staging (since its basically abandoned for the configfs gadget). Benoit from Google chimed in as well here. * Did an initial review of Serban's binder patches === Plans === * Sort out Linaro Connect expense reporting * Another Linaro candidate interview. * Review new volatile range patches from Minchan * Try to finish up timekeeping lock hold time reductions * Send pull request to tglx for my 3.10 queue * Work on earlysuspend blog post * Maybe send out pstore updates from Google? === Issues === * Jet lag returning from Linaro Connect was pretty bad. Think I'm finally over it though.

12 years, 7 months

Re: CPU power management bug -- CPU bound task fails to raise CPU frequency

by Viresh Kumar

On Sun, Mar 17, 2013 at 6:37 AM, Bruce Dawson <bruced(a)valvesoftware.com> wrote: > Dave/others, I've come up with a simple (and real) scenario where a CPU bound task running on Ubuntu (and presumably other Linux flavors) fails to be detected as CPU bound by the Linux kernel, meaning that the CPU continues to run at low speed, meaning that this CPU bound task takes (on my machines) about three times longer to run than it should. > > I found these e-mail addresses in the MAINTAINERS list under CPU FREQUENCY DRIVERS which I'm hoping is the correct area. Yes, cpufreq mailing list is the right list for this. > The basic problem is that on a multi-core system if you run a shell script that spawns lots of sub processes then the workload ends up distributed across all of the CPUs. Therefore, since none of the CPUs are particularly busy, the Linux kernel doesn't realize that a CPU bound task is running, so it leaves the CPU frequency set to low. I have confirmed the behavior in multiple ways. Specifically, I have used "iostat 1" and "mpstat -P ALL 1" to confirm that a full core's worth of CPU work is being done. mpstat also showed that the work was distributed across multiple cores. Using the zoom profiler UI for perf showed the sub processes (and bash) being spread across multiple cores, and perf stat showed that the CPU frequency was staying low even though the task was CPU bound. There are few things which would be helpful to understand what's going on. What governor is used in your case? Probably Ondemand (My ubuntu uses this). Ideally, cpu frequency is increased only if cpu load is very high (or above threshold, 95 in my ubuntu). So, the frequency might not be increased if there are multiple cpus running for a specific task and none of them has high enough load at that time. Other stuff that i suspect here is a bug which was solved recently by below patch. If policy->cpu (that might be cpu 0 for you) is sleeping, then load is never evaluated even if all other cpus are very busy. If you can try below patch then it might be helpful. BTW, you might not be able to apply it easily as it has got lots of dependencies.. and so you might need to pick all drivers/cpufreq patches from v3.9-rc1. commit 2abfa876f1117b0ab45f191fb1f82c41b1cbc8fe Author: Rickard Andersson <rickard.andersson(a)stericsson.com> Date: Thu Dec 27 14:55:38 2012 +0000 cpufreq: handle SW coordinated CPUs This patch fixes a bug that occurred when we had load on a secondary CPU and the primary CPU was sleeping. Only one sampling timer was spawned and it was spawned as a deferred timer on the primary CPU, so when a secondary CPU had a change in load this was not detected by the cpufreq governor (both ondemand and conservative). This patch make sure that deferred timers are run on all CPUs in the case of software controlled CPUs that run on the same frequency. Signed-off-by: Rickard Andersson <rickard.andersson(a)stericsson.com> Signed-off-by: Fabio Baltieri <fabio.baltieri(a)linaro.org> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki(a)intel.com> --- drivers/cpufreq/cpufreq_conservative.c | 3 ++- drivers/cpufreq/cpufreq_governor.c | 44 ++++++++++++++++++++++++++++++++++++++------ drivers/cpufreq/cpufreq_governor.h | 1 + drivers/cpufreq/cpufreq_ondemand.c | 3 ++- 4 files changed, 43 insertions(+), 8 deletions(-) > I have only reproed this behavior on six-core/twelve-thread systems. I would assume that at least a two-core system would be needed to repro this bug, and perhaps more. The bug will not repro if the system is not relatively idle, since a background CPU hog will force the frequency up. > > The repro is exquisitely simple -- ExprCount() is a simplified version of the repro (portable looping in a shell script) and BashCount() is an optimized and less portable version that runs far faster and also avoids this power management problem -- the CPU frequency is raised appropriately. Running a busy loop in another process is another way to get the frequency up and this makes ExprCount() run ~3x faster. Here is the script: > > -------------------------------------- > #!/bin/bash > function ExprCount() { > i=$1 > while [ $i -gt 0 ]; do > i=$(expr $i - 1) I may be wrong but one cpu is used to run this script and other one would be used to run expr program.. So, 2 cpus should be good enough to reproduce this setup. BTW, i have tried your scripts and was able to reproduce the setup here on a 2 cpu 4 thread system. -- viresh

12 years, 7 months

[ACTIVITY] 2013-3-2 - 2012-03-15

by David Long

=== David Long === === Highlights === * Attended Connect. The keynotes were particulaly good this time. Kudos to all who arranged this and those who arranged the weather (not that I got outside much). * Although the uprobe code is installing and taking the breakpoint properly, it is getting lost somewhere when it goes to execute the probed instruction out-of-line. I've tried to isolate this a couple of different ways, but no success yet. === Plans === * Continue isolating the xol problem and restructing the emulation code. === Issues === -dl

12 years, 7 months

LCA13: Video: Scheduler Internals by: Vincent Guittot

by Viresh Kumar

Hi Guys, Below are hangout upstreams of Scheduler Internals by Vincent Guittot done in LCA13. We have got another version of this recording that is done by some other cameras, but its size was 30 GB and so hard to upstream. In case you need that please contact me. Day 1: http://www.youtube.com/watch?v=2yzelou80JE Day 2: http://www.youtube.com/watch?v=fN11Lltx1nQ Thanks to Naresh for arranging for hangouts. -- Viresh

12 years, 7 months

Jump to page:

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

linaro-kernel