Following the discussion started here [1], I now have a proposal for tackling
generic support for host bridges described via device tree. It is an initial
stab at it, to try to get feedback and suggestions, but it is functional enough
that I have PCI Express for arm64 working on an FPGA using the patch that I am
also publishing that adds support for PCI for that platform.
Looking at the existing architectures that fit the requirements (use of device
tree and PCI) yields the powerpc and microblaze as generic enough to make them
candidates for conversion. I have a tentative patch for microblaze that I can
only compile test it, unfortunately using qemu-microblaze leads to an early
crash in the kernel.
As Bjorn has mentioned in the previous discussion, the idea is to add to
struct pci_host_bridge enough data to be able to reduce the size or remove the
architecture specific pci_controller structure. arm64 support actually manages
to get rid of all the architecture static data and has no pci_controller structure
defined. For host bridge drivers that means a change of API unless architectures
decide to provide a compatibility layer (comments here please).
In order to initialise a host bridge with the new API, the following example
code is sufficient for a _probe() function:
static int myhostbridge_probe(struct platform_device *pdev)
{
int err;
struct device_node *dev;
struct pci_host_bridge *bridge;
struct resource bus_range;
struct myhostbridge_port *pp;
LIST_HEAD(resources);
dev = pdev->dev.of_node;
if (!of_device_is_available(dev)) {
pr_warn("%s: disabled\n", dev->full_name);
return -ENODEV;
}
pp = kzalloc(sizeof(struct myhostbridge_port), GFP_KERNEL);
if (!pp)
return -ENOMEM;
err = of_pci_parse_bus_range(dev, &bus_range);
if (err) {
bus_range.start = 0;
bus_range.end = 255;
bus_range.flags = IORESOURCE_BUS;
}
pci_add_resource(&resources, &bus_range);
bridge = pci_host_bridge_of_init(&pdev->dev, 0, &myhostbridge_ops, pp, &resources);
if (!bridge) {
err = -EINVAL;
goto bridge_init_fail;
}
err = myhostbridge_setup(bridge->bus);
if (err)
goto bridge_init_fail;
/*
* Add flags here, this is just an example
*/
pci_add_flags(PCI_ENABLE_PROC_DOMAINS | PCI_COMPAT_DOMAIN_0);
pci_add_flags(PCI_REASSIGN_ALL_BUS | PCI_REASSIGN_ALL_RSRC);
bus_range.end = pci_scan_child_bus(bridge->bus);
pci_bus_update_busn_res_end(bridge->bus, bus_range.end);
pci_assign_unassigned_bus_resources(bridge->bus);
pci_bus_add_devices(bridge->bus);
return 0;
bridge_init_fail:
kfree(pp);
pci_free_resource_list(&resources);
return err;
}
Best regards,
Liviu Dudau
[1] http://thread.gmane.org/gmane.linux.kernel.pci/25946
Liviu Dudau (1):
pci: Add support for creating a generic host_bridge from device tree
drivers/pci/host-bridge.c | 92 +++++++++++++++++++++++++++++++++++++++++++++++
drivers/pci/probe.c | 11 ++++++
include/linux/pci.h | 14 ++++++++
3 files changed, 117 insertions(+)
--
1.8.5.3
Peter,
this patchset replace the beginning of the previous mixed one which was not
yet commited [1] without changes except a compilation error fix for UP kernel
config. It is refreshed against tip/sched/core.
As the UP config compilation is broken on the previous patchset, git bisect is
no longer safe for it. You can apply this small serie and drop the 3 first
patches of the previous series [1], or ignore it and I will bring the fix
after you applied [1].
It cleanups the idle_balance function parameters by passing the struct rq
only, fixes a race in the idle_balance function and finally move the idle_stamp
from fair.c to core.c. I am aware it will return back to fair.c with Peter's
pending patches but at least it changes the idle_balance() function to return
true if a balance occured.
[1] https://www.mail-archive.com/linux-kernel@vger.kernel.org/msg577271.html
Changelog:
V2:
* fixed compilation errors when CONFIG_SMP=n
V1: initial post
Daniel Lezcano (3):
sched: Remove cpu parameter for idle_balance()
sched: Fix race in idle_balance()
sched: Move idle_stamp up to the core
kernel/sched/core.c | 13 +++++++++++--
kernel/sched/fair.c | 20 +++++++++++++-------
kernel/sched/sched.h | 8 +-------
3 files changed, 25 insertions(+), 16 deletions(-)
--
1.7.9.5
Double ! or !! are normally required to get 0 or 1 out of a expression. A
comparision always returns 0 or 1 and hence there is no need to apply double !
over it again.
Signed-off-by: Viresh Kumar <viresh.kumar(a)linaro.org>
---
kernel/power/suspend.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/kernel/power/suspend.c b/kernel/power/suspend.c
index 62ee437..90b3d93 100644
--- a/kernel/power/suspend.c
+++ b/kernel/power/suspend.c
@@ -39,7 +39,7 @@ static const struct platform_suspend_ops *suspend_ops;
static bool need_suspend_ops(suspend_state_t state)
{
- return !!(state > PM_SUSPEND_FREEZE);
+ return state > PM_SUSPEND_FREEZE;
}
static DECLARE_WAIT_QUEUE_HEAD(suspend_freeze_wait_head);
--
1.7.12.rc2.18.g61b472e
On 23 January 2014 11:11, Lei Wen <adrian.wenl(a)gmail.com> wrote:
> On Wed, Jan 22, 2014 at 10:07 PM, Thomas Gleixner <tglx(a)linutronix.de> wrote:
>> On Wed, 22 Jan 2014, Lei Wen wrote:
>>> Recently I want to do the experiment for cpu isolation over 3.10 kernel.
>>> But I find the isolated one is periodically waken up by IPI interrupt.
>>>
>>> By checking the trace, I find those IPI is generated by add_timer_on,
>>> which would calls wake_up_nohz_cpu, and wake up the already idle cpu.
>>>
>>> With further checking, I find this timer is added by on_demand governor of
>>> cpufreq. It would periodically check each cores' state.
>>> The problem I see here is cpufreq_governor using INIT_DEFERRABLE_WORK
>>> as the tool, while timer is made as deferrable anyway.
>>> And what is more that cpufreq checking is very frequent. In my case, the
>>> isolated cpu is wakenup by IPI every 5ms.
>>>
>>> So why kernel need to wake the remote processor when mount the deferrable
>>> timer? As per my understanding, we'd better keep cpu as idle when use
>>> the deferrable timer.
>>
>> Indeed, we can avoid the wakeup of the remote cpu when the timer is
>> deferrable.
>
> Glad to hear that we could fix this unwanted wakeup.
> Do you have related patches already?
>
>>
>> Though you really want to figure out why the cpufreq governor is
>> arming timers on other cores every 5ms. That smells like an utterly
>> stupid approach.
>
> Not sure why cpufreq choose such frequent profiling over each cpu.
> As my understanding, since kernel is smp, launching profiler over one cpu
> would be enough...
Hi Guys,
So the first question is why cpufreq needs it and is it really stupid?
Yes, it is stupid but that's how its implemented since a long time. It does
so to get data about the load on CPUs, so that freq can be scaled up/down.
Though there is a solution in discussion currently, which will take
inputs from scheduler and so these background timers would go away.
But we need to wait until that time.
Now, why do we need that for every cpu, while that for a single cpu might
be enough? The answer is cpuidle here: What if the cpu responsible for
running timer goes to sleep? Who will evaluate the load then? And if we
make this timer run on one cpu in non-deferrable mode then that cpu
would be waken up again and again from idle. So, it was decided to have
a per-cpu deferrable timer. Though to improve efficiency, once it is fired
on any cpu, timer for all other CPUs are rescheduled, so that they don't
fire before 5ms (sampling time)..
I think below diff might get this fixed for you, though I am not sure if it
breaks something else. Probably Thomas/Frederic can answer here.
If this looks fine I will send it formally again:
diff --git a/kernel/timer.c b/kernel/timer.c
index accfd24..3a2c7fa 100644
--- a/kernel/timer.c
+++ b/kernel/timer.c
@@ -940,7 +940,8 @@ void add_timer_on(struct timer_list *timer, int cpu)
* makes sure that a CPU on the way to stop its tick can not
* evaluate the timer wheel.
*/
- wake_up_nohz_cpu(cpu);
+ if (!tbase_get_deferrable(timer->base))
+ wake_up_nohz_cpu(cpu);
spin_unlock_irqrestore(&base->lock, flags);
}
EXPORT_SYMBOL_GPL(add_timer_on);
This patchset relies on the "setting the table for integration of cpuidle with
the scheduler" from Nicolas Pitre where the idle.c file has been moved into the
sched directory.
It encapsulate the cpuidle main code into three exported functions which are
used by the cpuidle_idle_call function. This one is then moved into the idle.c
file.
The third patch shows an example on how integrating cpuidle information in the
scheduler is easier.
Daniel Lezcano (3):
cpuidle: split cpuidle_idle_call main function into functions
cpuidle: move the cpuidle_idle_call function to idle.c
idle: store the idle state index in the struct rq
drivers/cpuidle/cpuidle.c | 80 +++++++++++++++++++++++++--------------------
include/linux/cpuidle.h | 9 +++--
kernel/sched/idle.c | 53 ++++++++++++++++++++++++++++++
kernel/sched/sched.h | 3 ++
4 files changed, 108 insertions(+), 37 deletions(-)
--
1.7.9.5