Paul,
I've been having some thoughts about CBuild and Lava and the TCWG integration of them both. I wish to share them and open them up for general discussion.
The background to this has been the flakiness of the Panda's (due to heat), the Arndale (due to board 'set-up' issues), and getting a batch of Calxeda nodes working.
The following discussion refers to building and testing only, *not* benchmarking.
If you look at http://cbuild.validation.linaro.org/helpers/scheduler you will see a bunch of calxeda01_* nodes have been added to CBuild. After a week of sorting them out they provide builds twice as fast as the Panda boards. However, during the setup of the boards I came to the conclusion that we set build slaves up incorrectly, and that there is a better way.
The issues I encountered were: * The Calxeda's run quantal - yet we want to build on precise. * Its hard to get a machine running in hard-float to bootstrap a soft-float compiler and vice-versa. * My understanding of how the Lava integration works is that it runs the cbuild install scripts each time, and so we can't necessarily reproduce a build if the upstream packages have been changed.
Having thought about this a bit I came to the conclusion that the simple solution is to use chroots (managed by schroot), and to change the architecture a bit. The old architecture is everything is put into the main file-system as one layer. The new architecture would be to split this into two:
1. Rootfs - Contains just enough to boot the system and knows how to download an appropriate chroot and start it. 2. Chroots - these contain a setup build system that can be used for particular builds.
The rootfs can be machine type specific (as necessary), and for builds can be a stock linaro root filesystem. It will contain scripts to set the users needed up, and then to download an appropriate chroot and run it.
The chroot will be set up for a particular type of build (soft-float vs hard-float) and will be the same for all platforms. The advantage of this is that I can then download a chroot to my ChromeBook and reproduce a build locally in the same environment to diagnose issues.
The Calxeda nodes in cbuild use this type of infrastructure - the rootfs is running quantal (and I have no idea how it is configured - it is what Steve supplied me with). Each node then runs two chroots (precise armel and precise armhf) which take it in turns to ask the cbuild scheduler whether there is a job available.
So my first question is does any of the above make sense?
Next steps as I see it are:
1. Paul/Dave - what stage is getting the Pandaboards in the Lava farm cooled at? One advantage of the above architecture is we could use a stock Pandaboard kernel & rootfs that has thermal limiting turned on for builds, so that things don't fall over all the time.
2. Paul - how hard would it be to try and fire up a Calxeda node into Lava? We can use one of the ones assigned to me. I don't need any fancy multinode stuff that Michael Hudson-Doyle is working on - each node can be considered a separate board. I feel guilty that I put the nodes into CBuild without looking at Lava - but it was easier to do and got me going - I think correcting that is important
3. Generally - What's the state of the Arndale boards in Lava? Fathi has got GCC building reliably, although I believe he is now facing networking issues.
4. Paul - If Arndale boards are available in Lava - how much effort would it be to make them available to CBuild?
One issue the above doesn't solve as far as I see it is being able to say to Lava that we can do a build on any ARMv7-A CBuild compatible board. I don't generally care whether the build happens on an Arndale, Panda, or Calxeda board - I want the result in the shortest possible time.
A final note on benchmarking. I think the above scheme could work for benchmarking targets all we need to do is build a kernel/rootfs that is setup to provide a system that produces repeatable benchmarking results.
Comments welcome from all.
Thanks,
Matt
On Tue, Apr 16, 2013 at 10:49:23AM +0100, Matthew Gretton-Dann wrote:
- Paul - how hard would it be to try and fire up a Calxeda node
into Lava? We can use one of the ones assigned to me. I don't need any fancy multinode stuff that Michael Hudson-Doyle is working on - each node can be considered a separate board. I feel guilty that I put the nodes into CBuild without looking at Lava - but it was easier to do and got me going - I think correcting that is important
Support for the Calxeda nodes is being worked on (at code review stage), and as you would expect that's orthogonal to multi-node testing. It should land soonish.
One issue the above doesn't solve as far as I see it is being able to say to Lava that we can do a build on any ARMv7-A CBuild compatible board. I don't generally care whether the build happens on an Arndale, Panda, or Calxeda board - I want the result in the shortest possible time.
Good point, right now you have to explicitly ask for some device type ... but if you want the quickest response, your best bet is to submit to the faster devices. :-)
On 16 April 2013 12:37, Antonio Terceiro antonio.terceiro@linaro.orgwrote:
Good point, right now you have to explicitly ask for some device type ... but if you want the quickest response, your best bet is to submit to the faster devices. :-)
This is not the point, I think.
For toolchain testing, specific CPU matters less than for kernel testing. Even less important is which particular board revision or flavour. If the build system is smart and can figure out which CPU it's running (most are), it should make no difference if we run builds on dual-A9, quad-A9 or even A15, as long as it builds and passes the tests.
For instance, fixing Panda-ES on LAVA means I'll wait on a long queue, because there were only a few of them, while the old Panda had 15 idle all the time. They might be slower, but it's much quicker to get results from them than waiting for the ES to free up.
In the past, I have used a language that describes system properties to reserve boards (like "A9 & NEON & RAM >= 1GB") that would give me a list of available boards, when I'd choose one based on my own criteria. So, if you know how long it usually takes to build on X, Y and Z boards, and you have a list of jobs waiting on each one of them, with their own average build times, you can estimate which will be freed first, and list the boards sorted by that order. I could then pick the one I think it's best and add my build to that board's queue.
With the number of different boards going up and the total number of boards in the racks also going up, including virtual machines, I assume this will save a lot of time in the future, even though it looks quite daunting right now to implement.
cheers, --renato
PS: I've used this system completely automatic for our regressions tests, in parallel by many developers and benchmarks at the same time and it worked a charm.
On 16 April 2013 13:19, Renato Golin renato.golin@linaro.org wrote:
In the past, I have used a language that describes system properties to reserve boards (like "A9 & NEON & RAM >= 1GB") that would give me a list of available boards, when I'd choose one based on my own criteria.
The trouble with this approach (as you may be aware :-)) is that if the board farm includes a few 'rare' board types that happen to be covered by a broad system property criteria used by most people, it can be tricky to schedule jobs which really require the 'rare' board type, because the rare resource can get monopolised by a big job which could have run on anything but happened to get scheduled to the rare board because it was temporarily free. This is particularly acute if the rare board is also a rather slow one.
-- PMM
On 16 April 2013 13:28, Peter Maydell peter.maydell@linaro.org wrote:
The trouble with this approach (as you may be aware :-)) is that if the board farm includes a few 'rare' board types that happen to be covered by a broad system property criteria used by most people, it can be tricky to schedule jobs which really require the 'rare' board type, because the rare resource can get monopolised by a big job which could have run on anything but happened to get scheduled to the rare board because it was temporarily free. This is particularly acute if the rare board is also a rather slow one.
There are a number of ways you can overcome this, for example: * by not listing this particular board by components or configurations, but solely by name, so it can only be scheduled by specific jobs that call it by name, * adding a huge weight to it, making it always fall to the bottom of most lists, and only show up when you search so specific that only that board appears
There are other problems, too and they can be dealt with reasonably quickly, but validating each one is not a trivial task and gets incrementally difficult. I'm not claiming this should be top priority, but a possible future we might want to be in. ;)
cheers, --renato
Am 16.04.2013 11:49, schrieb Matthew Gretton-Dann:
The issues I encountered were:
- Its hard to get a machine running in hard-float to bootstrap a soft-float
compiler and vice-versa.
hmm, why?
when using precise or quantal as the build environment, then having these packages installed should be good enough:
libc6-dev-armhf [armel], libc6-dev-armel [armhf] binutils g++-multilib
Although I still have a local patch to support the multilib configuration:
http://anonscm.debian.org/viewvc/gcccvs/branches/sid/gcc-4.8/debian/patches/...
Matthias
On 16/04/13 14:08, Matthias Klose wrote:
Am 16.04.2013 11:49, schrieb Matthew Gretton-Dann:
The issues I encountered were:
- Its hard to get a machine running in hard-float to bootstrap a soft-float
compiler and vice-versa.
hmm, why?
when using precise or quantal as the build environment, then having these packages installed should be good enough:
libc6-dev-armhf [armel], libc6-dev-armel [armhf] binutils g++-multilib
Although I still have a local patch to support the multilib configuration:
http://anonscm.debian.org/viewvc/gcccvs/branches/sid/gcc-4.8/debian/patches/...
I honestly don't know what the issue is - except that when I try to bootstrap a vanilla FSF GCC arm-none-linux-gnueabi with the initial host compiler as arm-none-linux-gnueabihf I get failures during libraries builds in stage 1.
Also given that we try to build vanilla compilers, and so for 4.6 & 4.7 that requires fiddling with links in /usr/lib and /usr/include to point into the multiarch stuff, doing this in a chroot is safer than on the main system.
Thanks,
Matt
Am 16.04.2013 15:46, schrieb Matthew Gretton-Dann:
On 16/04/13 14:08, Matthias Klose wrote:
Am 16.04.2013 11:49, schrieb Matthew Gretton-Dann:
The issues I encountered were:
- Its hard to get a machine running in hard-float to bootstrap a soft-float
compiler and vice-versa.
hmm, why?
when using precise or quantal as the build environment, then having these packages installed should be good enough:
libc6-dev-armhf [armel], libc6-dev-armel [armhf] binutils g++-multilib
Although I still have a local patch to support the multilib configuration:
http://anonscm.debian.org/viewvc/gcccvs/branches/sid/gcc-4.8/debian/patches/...
I honestly don't know what the issue is - except that when I try to bootstrap a vanilla FSF GCC arm-none-linux-gnueabi with the initial host compiler as arm-none-linux-gnueabihf I get failures during libraries builds in stage 1.
Also given that we try to build vanilla compilers, and so for 4.6 & 4.7 that requires fiddling with links in /usr/lib and /usr/include to point into the multiarch stuff, doing this in a chroot is safer than on the main system.
this is not true. afaics all the active gcc linaro releases do have the multiarch patches merged from upstream. So knowing the root cause would be better than tampering with the links.
Matthias
Hello Matt,
There were quite a few responses already, so I'll try to focus on the questions to which I think I may contribute something useful.
On Tue, 16 Apr 2013 10:49:23 +0100 Matthew Gretton-Dann matthew.gretton-dann@linaro.org wrote:
Paul,
I've been having some thoughts about CBuild and Lava and the TCWG integration of them both. I wish to share them and open them up for general discussion.
The background to this has been the flakiness of the Panda's (due to heat), the Arndale (due to board 'set-up' issues), and getting a batch of Calxeda nodes working.
The following discussion refers to building and testing only, *not* benchmarking.
If you look at http://cbuild.validation.linaro.org/helpers/scheduler you will see a bunch of calxeda01_* nodes have been added to CBuild. After a week of sorting them out they provide builds twice as fast as the Panda boards. However, during the setup of the boards I came to the conclusion that we set build slaves up incorrectly, and that there is a better way.
The issues I encountered were:
- The Calxeda's run quantal - yet we want to build on precise.
- Its hard to get a machine running in hard-float to bootstrap a
soft-float compiler and vice-versa.
- My understanding of how the Lava integration works is that it
runs the cbuild install scripts each time, and so we can't necessarily reproduce a build if the upstream packages have been changed.
Having thought about this a bit I came to the conclusion that the simple solution is to use chroots (managed by schroot), and to change the architecture a bit. The old architecture is everything is put into the main file-system as one layer. The new architecture would be to split this into two:
- Rootfs - Contains just enough to boot the system and knows how
to download an appropriate chroot and start it. 2. Chroots - these contain a setup build system that can be used for particular builds.
The rootfs can be machine type specific (as necessary), and for builds can be a stock linaro root filesystem. It will contain scripts to set the users needed up, and then to download an appropriate chroot and run it.
The chroot will be set up for a particular type of build (soft-float vs hard-float) and will be the same for all platforms. The advantage of this is that I can then download a chroot to my ChromeBook and reproduce a build locally in the same environment to diagnose issues.
The Calxeda nodes in cbuild use this type of infrastructure - the rootfs is running quantal (and I have no idea how it is configured - it is what Steve supplied me with). Each node then runs two chroots (precise armel and precise armhf) which take it in turns to ask the cbuild scheduler whether there is a job available.
So my first question is does any of the above make sense?
If you propose LAVA builds to use such chroot setup, then it technically should be possible, but practically it will be quite a chore to setup and maintain. If we want to use LAVA, why don't we follow its way directly? It already allows to use (and switch easily) any rootfs directly. There should be distro methods to pin packages to specific versions. If you want to run LAVA's rootfs in chroot on Chromebook, you can do just that - take one, transform to chroot and use ("transform" stage may take a bit of effort initially, but at LAVA rootfs is wholly based on Linaro standard linaro-media-create technology, done once, it's reusable for all Linaro builds).
Next steps as I see it are:
- Paul/Dave - what stage is getting the Pandaboards in the Lava
farm cooled at? One advantage of the above architecture is we could use a stock Pandaboard kernel & rootfs that has thermal limiting turned on for builds, so that things don't fall over all the time.
I'm currently focusing on critical android-build issues, so anything else is in backlog. And next up in my queue is supporting IT with global Linaro services EC2 migration ;-I.
But the problem we have is not that we can't get reliable *builds* in LAVA - it's that the *complete* CBuild picture doesn't work in LAVA. Benchmarks is a culprit specifically. If you want reliable builds, just use "lava-panda-usbdrive" queue - that will use those 15 standard Panda boards mentioned by Renato, with known good rootfs/kernel. The problem, gcc, etc. binaries produced by those builds won't run on benchmarking image, because OS versions of "known good Panda rootfs" and "validated CBuild PandaES rootfs" are different.
- Paul - how hard would it be to try and fire up a Calxeda node
into Lava?
As other folks answered, that completely depends on work which (old-time) LAVA people do, not something I (a former Infra engineer) can influence so far.
We can use one of the ones assigned to me. I don't need any fancy multinode stuff that Michael Hudson-Doyle is working on - each node can be considered a separate board. I feel guilty that I put the nodes into CBuild without looking at Lava - but it was easier to do and got me going - I think correcting that is important
- Generally - What's the state of the Arndale boards in Lava?
Fathi has got GCC building reliably, although I believe he is now facing networking issues.
- Paul - If Arndale boards are available in Lava - how much effort
would it be to make them available to CBuild?
The next step if getting good rootfs for it. Note that if a board is *supported* by LAVA, there's always at least one good rootfs - one which is used by LAVA itself as "meta rootfs" (one into which board gets booted to configure target rootfs). Then, it's just switching board type in job template, and go fixing bugs we see in first builds. Summing up: it should be easy, modulo any unexpected things which may happen along the route.
One issue the above doesn't solve as far as I see it is being able to say to Lava that we can do a build on any ARMv7-A CBuild compatible board. I don't generally care whether the build happens on an Arndale, Panda, or Calxeda board - I want the result in the shortest possible time.
LAVA has flexible tag functionality which allows to do almost all things mentioned by Renato. But that's theoretical point, practically, there always will be differences and issues, so having affinity to particular board at a given time makes sense (but migrating in agile manner as "better" boards became available).
A final note on benchmarking. I think the above scheme could work for benchmarking targets all we need to do is build a kernel/rootfs that is setup to provide a system that produces repeatable benchmarking results.
Yes, we only need to make a kernel/rootfs which is suitable for boards (and their particular setup, like cooling culprits) we have at hand. It seems that it's exactly what can't be achieved easily, because no single team has enough data/experience. For example, TCWG knows how to setup rootfs for toolchain build, but doesn't know what's good rootfs/combo for a particular combo. QA services would know (at least I'd hope so!) which basic rootfs/kernel is good for a board, but can't provide CBuild-ready one, and we don't even involve them in discussion. Infra/LAVA kinda runs this stuff, and would be happy to prepare/deploy needed image given specific data and instructions, but without them, it would be painfully slow, because we'd need to re-investigate all this stuff, hitting lot of dead-ends, and being sidetracked by other stuff regularly.
But besides that, one important point is that building a new kernel/rootfs pair for benchmarking means invalidating all the previous results (or introduce discontinuity). Are you ready to do that? I always treated it as invariant "no" for my work on CBuild/LAVA, but sooner or later that would need to be done anyway. If it can be done now (for LAVA builds) then let's just switch to using plain Pandas right away, be ready to switch to Arndale/Calxeda as they become stable - and let matter of preparing new benchmark image just run in the background.
Comments welcome from all.
Thanks,
Matt
Hello,
On Wed, 17 Apr 2013 06:44:56 +0300 Paul Sokolovsky Paul.Sokolovsky@linaro.org wrote:
[]
But the problem we have is not that we can't get reliable *builds* in LAVA - it's that the *complete* CBuild picture doesn't work in LAVA. Benchmarks is a culprit specifically. If you want reliable builds, just use "lava-panda-usbdrive" queue - that will use those 15 standard Panda boards mentioned by Renato, with known good rootfs/kernel. The problem, gcc, etc. binaries produced by those builds won't run on benchmarking image, because OS versions of "known good Panda rootfs" and "validated CBuild PandaES rootfs" are different.
Ok, this discussed appear to get backlogged in release rush, and may be forgotten for some time after it, so I'm proceeding with this proposed intermediate solution - switch daily gcc builds to "lava-panda-usbdrive" queue. That can't do much hard, as lava-pandaes-usbdrive results are not usable at all.
I also see that gcc-4.9 builds take even longer time, so increased LAVA timeout for them.
[]
linaro-toolchain@lists.linaro.org