Hi Folks,
I'm running two buildbots here at home and am getting consistent failures from the Pandas because of overheating. I've set up a monitor that will tell me the current CPU temperature and the allowed maximum, and when the bot passes 90%, it shuts itself off.
The problem is that I'm running with heat-sinks and the boards are on top of three fans, so there really isn't much more I can do to solve this problem.
I personally think this is a hardware problem, since everything is in the same die, CPU, GPU and RAM, and the physical dimensions of the chip are quite small. I remember when Intel started overheating (around 486DX66) and the die was huge (more head dissipation), plus RAM and GPU were separate, and it still needed a hefty heat-sink.
It's true that gates are far smaller today, but it's not true that a dual core 1.3GHz + GPU + RAM will produce less heat on a small die than a 66KHz CPU on a huge die, so why anyone think it's a good idea to release a 1+GHz chip without *any* form of heat dissipation is beyond my comprehension.
Manufacturers only got away with it, so far, because people rarely use 100% of the CPU power for extended periods of time, because ARM devices end up as set-top boxes, mobile phones and tablets. However, even those devices will heat up when playing 2 h films or games, and they do have some form of heat sink.
We, at the toolchain group, make things worse by using 100% CPU, 24 / 7, something that Panda boards, or Arndales were not designed to do. However, with ARM moving into the server space, their designs will have to be re-thought, and what a better place than Linaro for making sure we get it right?
For the time being, I believe we *must* have air conditioning in the Lab all the time, and we *must* have heat-sinks on every board, and we *must* monitor the CPU temperature of the boards, at least until we're comfortable that they're not failing all the time.
Can we make a temperature monitor (like the one attached) a default feature on Linaro Ubuntu distributions? We could dump that info to the syslog/dmesg whenever it crosses the (say) 75% threshold, and report more often when it crosses the 95%, possibly dumping the processe(s) that are consuming more CPU at the time, to enable post-mortem debugging.
cheers, --renato
As a side note, the quad-A9 ODroid does ship with a massive heat-sink, which also serves as a fancy case. Quite clever, really.
I believe that in the LAVA lab there are a few pandas with USB keys that are used for builds to try and overcome some reliability problems. Don't know if it was a temperature problem or something else. With any luck someone who knows more about that issue can speak up and share what they found. You could also try running "stress --cpu 4 --vm 2" and see if any errors show. I find that on my desktop running 2x the number of CPU stress threads as I have CPUs is about right to eat all available resources. That will just stress RAM and CPU, not disk I/O, which should pinpoint the problem. Plenty of other options (http://www.hecticgeek.com/2012/11/stress-test-your-ubuntu-computer-with-stre...)...
Is running at 100% of the thermal limit really an issue? Isn't the point that it is the limit, which itself should have some safety built in? I don't know off hand if the OMAP 4 SoCs incorporate hardware frequency limiting or if it is entirely software, in which case the kernel frequency governor should (at a guess) be throttling back.
I did have a panda give up on me about a year ago. It wasn't being worked hard, but did refuse to get through a boot most of the time (it did power on and get part way through booting). Those boards aren't designed for high reliability and it may be that you just need to get a couple of replacements.
James
On 3 July 2013 14:13, Renato Golin renato.golin@linaro.org wrote:
Hi Folks,
I'm running two buildbots here at home and am getting consistent failures from the Pandas because of overheating. I've set up a monitor that will tell me the current CPU temperature and the allowed maximum, and when the bot passes 90%, it shuts itself off.
The problem is that I'm running with heat-sinks and the boards are on top of three fans, so there really isn't much more I can do to solve this problem.
I personally think this is a hardware problem, since everything is in the same die, CPU, GPU and RAM, and the physical dimensions of the chip are quite small. I remember when Intel started overheating (around 486DX66) and the die was huge (more head dissipation), plus RAM and GPU were separate, and it still needed a hefty heat-sink.
It's true that gates are far smaller today, but it's not true that a dual core 1.3GHz + GPU + RAM will produce less heat on a small die than a 66KHz CPU on a huge die, so why anyone think it's a good idea to release a 1+GHz chip without *any* form of heat dissipation is beyond my comprehension.
Manufacturers only got away with it, so far, because people rarely use 100% of the CPU power for extended periods of time, because ARM devices end up as set-top boxes, mobile phones and tablets. However, even those devices will heat up when playing 2 h films or games, and they do have some form of heat sink.
We, at the toolchain group, make things worse by using 100% CPU, 24 / 7, something that Panda boards, or Arndales were not designed to do. However, with ARM moving into the server space, their designs will have to be re-thought, and what a better place than Linaro for making sure we get it right?
For the time being, I believe we *must* have air conditioning in the Lab all the time, and we *must* have heat-sinks on every board, and we *must* monitor the CPU temperature of the boards, at least until we're comfortable that they're not failing all the time.
Can we make a temperature monitor (like the one attached) a default feature on Linaro Ubuntu distributions? We could dump that info to the syslog/dmesg whenever it crosses the (say) 75% threshold, and report more often when it crosses the 95%, possibly dumping the processe(s) that are consuming more CPU at the time, to enable post-mortem debugging.
cheers, --renato
As a side note, the quad-A9 ODroid does ship with a massive heat-sink, which also serves as a fancy case. Quite clever, really.
linaro-validation mailing list linaro-validation@lists.linaro.org http://lists.linaro.org/mailman/listinfo/linaro-validation
On 3 July 2013 15:42, James Tunnicliffe james.tunnicliffe@linaro.orgwrote:
I believe that in the LAVA lab there are a few pandas with USB keys that are used for builds to try and overcome some reliability problems.
I'm using USB drives for that reason.
Is running at 100% of the thermal limit really an issue? Isn't the
point that it is the limit, which itself should have some safety built in? I don't know off hand if the OMAP 4 SoCs incorporate hardware frequency limiting or if it is entirely software, in which case the kernel frequency governor should (at a guess) be throttling back.
That's what I thought, but apparently, both Panda and Panda ES on current Linaro Ubuntu 13.03 fail randomly with USB drives (SSD or HDD) after a few hours under constant load. That means it's impossible for me to use them for toolchain testing at all. Arndales have also given up after a few hours, though after the errata kernel patches it was a bit better.
The only board that hasn't failed yet is the Chromebook, which has clocked a solid 5-month period under intense load. Guess what? The Chromebook's A15, which is identical to the Arndale's, has a massive heat-sink almost the size of the laptop itself.
I did have a panda give up on me about a year ago. It wasn't being
worked hard, but did refuse to get through a boot most of the time (it did power on and get part way through booting). Those boards aren't designed for high reliability and it may be that you just need to get a couple of replacements.
I have tried 5 different Pandas and all of them fail the same way. I don't think it's a matter of replacing the defective, but of trying a new board altogether...
cheers, --renato
On 3 July 2013 14:13, Renato Golin renato.golin@linaro.org wrote:
Hi Folks,
I'm running two buildbots here at home and am getting consistent failures from the Pandas because of overheating. I've set up a monitor that will tell me the current CPU temperature and the allowed maximum, and when the bot passes 90%, it shuts itself off.
The problem is that I'm running with heat-sinks and the boards are on top of three fans, so there really isn't much more I can do to solve this problem.
I personally think this is a hardware problem, since everything is in the same die, CPU, GPU and RAM, and the physical dimensions of the chip are quite small. I remember when Intel started overheating (around 486DX66) and the die was huge (more head dissipation), plus RAM and GPU were separate, and it still needed a hefty heat-sink.
It's true that gates are far smaller today, but it's not true that a dual core 1.3GHz + GPU + RAM will produce less heat on a small die than a 66KHz CPU on a huge die, so why anyone think it's a good idea to release a 1+GHz chip without *any* form of heat dissipation is beyond my comprehension.
Modern silicon processes are much more power-efficient than those of the 90s. For example, an old ~500MHz Alpha machine I have readily consumes 90W even when idle. A quad-core Intel i7 typically has a TDP of 130W at full load. That's orders of magnitude more gates clocked at 6x the frequency and still using only marginally more power.
BTW, the RAM is a separate chip mounted on top of the SoC.
Manufacturers only got away with it, so far, because people rarely use 100% of the CPU power for extended periods of time, because ARM devices end up as set-top boxes, mobile phones and tablets. However, even those devices will heat up when playing 2 h films or games, and they do have some form of heat sink.
An OMAP4460 will run at 1.2GHz indefinitely without overheating in reasonable ambient temperature. The higher frequencies are only meant to be used in conjunction with (software) thermal management to throttle back if temperature rises.
If you don't have thermal management in the kernel you're running, you need to clamp the clock at a safe value.
On 3 July 2013 15:59, Mans Rullgard mans.rullgard@linaro.org wrote:
Modern silicon processes are much more power-efficient than those of the 90s. For example, an old ~500MHz Alpha machine I have readily consumes 90W even when idle. A quad-core Intel i7 typically has a TDP of 130W at full load. That's orders of magnitude more gates clocked at 6x the frequency and still using only marginally more power.
I don't remember the numbers exactly, but the DX Intel machines weren't that power-hungry. Here[1] I read they used 600mA on a 5V input, which gives you 3W consumption, and it already had a heat-sink. ;)
An OMAP4460 will run at 1.2GHz indefinitely without overheating in
reasonable ambient temperature.
Probably in Sweden, "room temperature" is -10... ;)
But running at 1.2GHz doesn't mean it will be using the whole system, RAM and GPU included, which being on the same SoC, contribute to the overall temperature. I've seen some GPU errors on the syslog, not sure it's related to the failures, or caused by them.
If you don't have thermal management in the kernel you're running, you need to clamp the clock at a safe value.
I'd expect that Linaro's kernel on Ubuntu 13.03 already had a decent thermal control of the Panda. I can get the temperatures without special code, so I assume the kernel knows precisely what to do, and I also hope that the kernel can do scheduling, otherwise, what's the point of measuring temperatures...
But more to the point, I don't want to be scaled down when hot, I want it never to get hot in the first place, so I can run at full 1.2GHz, 24 / 7. If the scheduler reduces the frequency to decrease the temperature, I'll be testing more commits per run AND my benchmarks will be skewed, depending on room temperature, which is the same as to say they're not benchmarks at all.
cheers, --renato
On 3 July 2013 16:48, Renato Golin renato.golin@linaro.org wrote:
On 3 July 2013 15:59, Mans Rullgard mans.rullgard@linaro.org wrote:
An OMAP4460 will run at 1.2GHz indefinitely without overheating in reasonable ambient temperature.
If you don't have thermal management in the kernel you're running, you need to clamp the clock at a safe value.
I'd expect that Linaro's kernel on Ubuntu 13.03 already had a decent thermal control of the Panda. I can get the temperatures without special code, so I assume the kernel knows precisely what to do, and I also hope that the kernel can do scheduling, otherwise, what's the point of measuring temperatures...
But more to the point, I don't want to be scaled down when hot, I want it never to get hot in the first place, so I can run at full 1.2GHz, 24 / 7. If the scheduler reduces the frequency to decrease the temperature, I'll be testing more commits per run AND my benchmarks will be skewed, depending on room temperature, which is the same as to say they're not benchmarks at all.
I repeat, the 4460 will run at 1.2GHz indefinitely without thermal management. 1.4GHz and higher _does_ require active thermal management, and I would not assume that a random kernel has this feature enabled merely because it can report the temperature.
If you want to run benchmarks on this chip, you must do so at no higher than 1.2GHz. The chip is designed for phones/tablets where high CPU load typically only occurs in short bursts.
On 3 July 2013 17:22, Mans Rullgard mans.rullgard@linaro.org wrote:
I repeat, the 4460 will run at 1.2GHz indefinitely without thermal management.
My mistake, I said 1.3GHz when it was actually 1.2GHz. So, at 1.2GHz, it freezes every few hours on full load on both 4430 and 4460.
linaro@linaro-panda-01:~$ cat /sys/devices/system/cpu/cpu0/cpufreq/cpuinfo_max_freq 1200000
Now what?
cheers, --renato
On 3 July 2013 17:41, Renato Golin renato.golin@linaro.org wrote:
On 3 July 2013 17:22, Mans Rullgard mans.rullgard@linaro.org wrote:
I repeat, the 4460 will run at 1.2GHz indefinitely without thermal management.
My mistake, I said 1.3GHz when it was actually 1.2GHz. So, at 1.2GHz, it freezes every few hours on full load on both 4430 and 4460.
linaro@linaro-panda-01:~$ cat /sys/devices/system/cpu/cpu0/cpufreq/cpuinfo_max_freq 1200000
Now what?
Are you using the same set up as the LAVA lab in terms of OS, kernel, software versions? If the Cbuild/LAVA boards run reliably (and I don't think they have any direct cooling or a heatsink on them), then that is a useful place to start.
-- James Tunnicliffe
On 3 July 2013 17:41, Renato Golin renato.golin@linaro.org wrote:
On 3 July 2013 17:22, Mans Rullgard mans.rullgard@linaro.org wrote:
I repeat, the 4460 will run at 1.2GHz indefinitely without thermal management.
My mistake, I said 1.3GHz when it was actually 1.2GHz. So, at 1.2GHz, it freezes every few hours on full load on both 4430 and 4460.
linaro@linaro-panda-01:~$ cat /sys/devices/system/cpu/cpu0/cpufreq/cpuinfo_max_freq 1200000
Now what?
4430 max frequency is 1.0GHz unless I'm mistaken. Either way, try reducing your clock to 1.0GHz and see what happens.
On 3 July 2013 18:08, Mans Rullgard mans.rullgard@linaro.org wrote:
4430 max frequency is 1.0GHz unless I'm mistaken. Either way, try reducing your clock to 1.0GHz and see what happens.
Yes, I meant 4430 and 4460 at their natural high frequencies.
$ cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_available_frequencies 350000 700000 920000 1200000
I'll set the max to 920MHz on the scaling and let's see how it goes...
cheers, --renato
On 03/07/13 17:41, Renato Golin wrote:
On 3 July 2013 17:22, Mans Rullgard <mans.rullgard@linaro.org mailto:mans.rullgard@linaro.org> wrote:
I repeat, the 4460 will run at 1.2GHz indefinitely without thermal management.
My mistake, I said 1.3GHz when it was actually 1.2GHz. So, at 1.2GHz, it freezes every few hours on full load on both 4430 and 4460.
linaro@linaro-panda-01:~$ cat /sys/devices/system/cpu/cpu0/cpufreq/cpuinfo_max_freq 1200000
Now what?
keep lowering the clock limit (.../cpufreq/scaling_max_freq) until you get stability. If you don't, then it isn't a heating problem.
Remember that manufacturers match the form of packaging to the expected TDP of the intended usage environment (to keep product costs down). In a mobile part that probably means relatively cheap plastic package because a hot chip would burn a hole in your pocket -- literally. The package almost certainly doesn't have a high thermal conductivity from the chip to the external surface so while a heat sink might help, it won't be as effective as with other packaging options.
Chips expected to dissipate large amounts of power normally have a metal pad on the package so that a heat sink with thermal grease will make a good thermal contact.
R.
On 3 July 2013 18:33, Richard Earnshaw rearnsha@arm.com wrote:
Chips expected to dissipate large amounts of power normally have a metal pad on the package so that a heat sink with thermal grease will make a good thermal contact.
This is a really good point. The heat-sink do get really hot, but it's not the final temperature that matters, but the speed in which it dissipates through the plastic bit to the heat-sink during peak usage, and plastic sucks at thermal conductivity.
Let's see how it behaves at 920MHz...
I wonder if the ODroid heat-sink, which is bigger than the board itself, is really that effective, or just more of a vanity item. The Arndale has a metallic case, and I could fit a north-bridge heat-sink on it, which is bigger than the RAM heat-sink I put on the Pandas, and after the errata fix, they did behave properly at full speed.
cheers, --renato
On 3 July 2013 18:33, Richard Earnshaw rearnsha@arm.com wrote:
On 03/07/13 17:41, Renato Golin wrote:
On 3 July 2013 17:22, Mans Rullgard <mans.rullgard@linaro.org mailto:mans.rullgard@linaro.org> wrote:
I repeat, the 4460 will run at 1.2GHz indefinitely without thermal management.
My mistake, I said 1.3GHz when it was actually 1.2GHz. So, at 1.2GHz, it freezes every few hours on full load on both 4430 and 4460.
linaro@linaro-panda-01:~$ cat /sys/devices/system/cpu/cpu0/cpufreq/cpuinfo_max_freq 1200000
Now what?
keep lowering the clock limit (.../cpufreq/scaling_max_freq) until you get stability. If you don't, then it isn't a heating problem.
Remember that manufacturers match the form of packaging to the expected TDP of the intended usage environment (to keep product costs down). In a mobile part that probably means relatively cheap plastic package because a hot chip would burn a hole in your pocket -- literally. The package almost certainly doesn't have a high thermal conductivity from the chip to the external surface so while a heat sink might help, it won't be as effective as with other packaging options.
Chips expected to dissipate large amounts of power normally have a metal pad on the package so that a heat sink with thermal grease will make a good thermal contact.
The PoP RAM also complicates cooling.
On 3 July 2013 18:33, Richard Earnshaw rearnsha@arm.com wrote:
keep lowering the clock limit (.../cpufreq/scaling_max_freq) until you get stability. If you don't, then it isn't a heating problem.
It might be a bit too soon, but I just got a few 7h builds out of the boards at 920MHz without a single glitch, whereas before, they wouldn't run for more than 4hs in a row. Both boards are running non-stop since 8pm yesterday.
I'll keep them running during Connect on the exact same place as they are now (and were before), just to be sure, but I'm still betting that they cannot run at 1.2GHz on full steam for long periods without some serious cooling.
cheers, --renato
On 4 July 2013 12:27, Renato Golin renato.golin@linaro.org wrote:
On 3 July 2013 18:33, Richard Earnshaw rearnsha@arm.com wrote:
keep lowering the clock limit (.../cpufreq/scaling_max_freq) until you get stability. If you don't, then it isn't a heating problem.
It might be a bit too soon, but I just got a few 7h builds out of the boards at 920MHz without a single glitch, whereas before, they wouldn't run for more than 4hs in a row. Both boards are running non-stop since 8pm yesterday.
I'll keep them running during Connect on the exact same place as they are now (and were before), just to be sure, but I'm still betting that they cannot run at 1.2GHz on full steam for long periods without some serious cooling.
Faster clocks also drink more power - now you are using slower clocks those boards will be stressing the PSU less. There are plenty of other components on the board, any of which could be causing the problem, including the peripherals that you plugged in.
If you don't have another 5V PSU to try but do have a spare ATX PSU then it isn't difficult to hook up the 0 and 5V rails from a molex connector. May be worth a go. You could easily run all the boards from 1 ATX PSU. (short pin 16 (green) to a black pin to turn the PSU on http://en.wikipedia.org/wiki/ATX#Power_supply).
James
On 4 July 2013 12:58, James Tunnicliffe james.tunnicliffe@linaro.orgwrote:
Faster clocks also drink more power - now you are using slower clocks those boards will be stressing the PSU less.
That is true. Though, if memory serves me well, I think I was using one decent power supply and one cheap in the lab, and both Pandas were failing randomly.
Matt, if you could have a look at my rack shelf (it's the top one with the chromebook in it), there should be a power supply with a velcro on it. I only used the cheap one on my second Panda, because that was the one I was using to set it up on my desk.
Furthermore, mine were not the only Pandas failing when running toolchain testing and benchmarking...
There are plenty of other
components on the board, any of which could be causing the problem, including the peripherals that you plugged in.
Just network and a USB thumb drive. Shouldn't be problematic.
I don't have a PSU at home nor a decent power supply here, so I can't perform this test, but I'm not really sure it will make a difference based on what happened in the lab rack.
--renato
On 4 July 2013 12:27, Renato Golin renato.golin@linaro.org wrote:
On 3 July 2013 18:33, Richard Earnshaw rearnsha@arm.com wrote:
keep lowering the clock limit (.../cpufreq/scaling_max_freq) until you get stability. If you don't, then it isn't a heating problem.
It might be a bit too soon, but I just got a few 7h builds out of the boards at 920MHz without a single glitch, whereas before, they wouldn't run for more than 4hs in a row. Both boards are running non-stop since 8pm yesterday.
Yesterday I turned one of the boards back to 1.2GHz (3pm), and it died during the night (2am). The 920MHz is still working. The room temperature didn't go over 26C (the thermometer is by the boards).
I do not believe it is possible to run the 4460 at 1.2GHz on full load without decent thermal management. I can see the frequency changing due to load on my log, so the kernel is doing "something", but I don't think it's actively slowing things down due to temperature concerns.
The heat sink improved the load periods (based on lab data), but as Richard said, thermal conductivity has to be minimum along all the path out, and the plastic casing does not help.
I've run the cpuburn at 920MHz and it runs indefinitely at around 70% max temperature (51C). When I set the maximum to 1.2GHz, it dies in 5 seconds.
Does anyone know how to turn on thermal management on the Linux kernel for the OMAP chips?
cheers, --renato
On 5 July 2013 08:56, Renato Golin renato.golin@linaro.org wrote:
Yesterday I turned one of the boards back to 1.2GHz (3pm), and it died during the night (2am). The 920MHz is still working. The room temperature didn't go over 26C (the thermometer is by the boards).
Status update:
One of the boards failed at 920MHz @ 60% temperature levels. I blamed the power supply and switched to make sure the *other* board would fail as well at 920MHz, which it did. So I retired that power supply and am using a third one, all cheap (all I have here).
I'm not sure how it managed to run for two full days before, but it could be due to the temperature of the room that has increased these last two days and the power supply itself is overheating.
cheers, --renato
On Sat, Jul 06, 2013 at 09:39:13AM +0100, Renato Golin wrote:
On 5 July 2013 08:56, Renato Golin renato.golin@linaro.org wrote:
Yesterday I turned one of the boards back to 1.2GHz (3pm), and it died during the night (2am). The 920MHz is still working. The room temperature didn't go over 26C (the thermometer is by the boards).
Status update:
One of the boards failed at 920MHz @ 60% temperature levels. I blamed the power supply and switched to make sure the *other* board would fail as well at 920MHz, which it did. So I retired that power supply and am using a third one, all cheap (all I have here).
I know others have said this, but since you seem to not yet be convinced: the frequency is a red herring. If your PSU can't really keep up with the Panda, all bets are off. We early on in the LAVA lab figured out they only ran reliably on those massive (IIRC, 4A) bricks that Digikey sells.
On Sun, 7 Jul 2013 19:00:47 -0300 Christian Robottom Reis kiko@canonical.com wrote:
On Sat, Jul 06, 2013 at 09:39:13AM +0100, Renato Golin wrote:
On 5 July 2013 08:56, Renato Golin renato.golin@linaro.org wrote:
Yesterday I turned one of the boards back to 1.2GHz (3pm), and it died during the night (2am). The 920MHz is still working. The room temperature didn't go over 26C (the thermometer is by the boards).
Status update:
One of the boards failed at 920MHz @ 60% temperature levels. I blamed the power supply and switched to make sure the *other* board would fail as well at 920MHz, which it did. So I retired that power supply and am using a third one, all cheap (all I have here).
I know others have said this, but since you seem to not yet be convinced: the frequency is a red herring. If your PSU can't really keep up with the Panda, all bets are off. We early on in the LAVA lab figured out they only ran reliably on those massive (IIRC, 4A) bricks that Digikey sells.
The alleged PandaBoard compatibility with only a single model of a massive 4A power brick does not sound right. Especially considering that the PandaBoard is supposed to have a superior hardware design quality compared to the competitors:
http://www.pandaboard.org/pbirclogs/index.php?date=2012-10-16#T01:39:33
Earlier Renato Golin mentioned that "These are 5V and on my multimeter I got almost 6V". Looks like some really cheap poorly regulated junk PSU?
My power bricks costed me around 10-15 EUR each (maybe that's overpriced and I could find something several times cheaper from some Chinese vendors?). All of them seem to be working fine, powering various development boards with weeks/months of uptime. And occasionally experiencing long heavy compilation workloads (gcc, libreoffice, llvm, chromium, firefox, ...) without problems. The voltage is very close to 5V, at least when measured without load. I'm getting ~5.2V from the "worst" one and the others deviate much less.
And naturally, any PSU rated for just something like 1A will not work right, that's a common sense. The modern multi-core ARM boards can easily consume a lot more than this under load. But at least 2.5A or 3A should be sufficient if you don't attach many power hungry USB peripherals.
On 8 July 2013 14:48, Siarhei Siamashka siarhei.siamashka@gmail.com wrote:
And naturally, any PSU rated for just something like 1A will not work right, that's a common sense. The modern multi-core ARM boards can easily consume a lot more than this under load. But at least 2.5A or 3A should be sufficient if you don't attach many power hungry USB peripherals.
AFAICR, that PSU is 2.5A, but I could be wrong. I'll check when I get back home. But it is cheap, and it could itself overheat, too.
So, for the time being, I'll leave it on 920MHz and keep the bots running until we sort out the power/temp problem.
Anyway, I'd like to move away from Pandas to something with a bit more horse power.
cheers, --renato
On Wed, 3 Jul 2013 16:48:51 +0100 Renato Golin renato.golin@linaro.org wrote:
But more to the point, I don't want to be scaled down when hot, I want it never to get hot in the first place, so I can run at full 1.2GHz, 24 / 7. If the scheduler reduces the frequency to decrease the temperature, I'll be testing more commits per run AND my benchmarks will be skewed, depending on room temperature, which is the same as to say they're not benchmarks at all.
For getting reproducible benchmark results, you just need to ensure that thermal throttling never kicks in. If the kernel is compiled with cpufreq stats enabled, you can compare these stats before/after your benchmark to ensure that it spent all the time running at the same designated clock frequency.
If you get thermal throttling interfering, just keep reducing the CPU clock frequency until this problem disappears. Alternatively, you can possibly reduce the CPU core voltage, but this drives the hardware beyond normal operational limits. Or slightly increase the critical temperature in the thermal framework. However these tricks are only necessary if you need to publish something like "1.2GHz PandaBoard benchmark results" and the non-stock clock frequency is simply out of the question.
Anyway, I recommend you to start the tests for the hardware robustness with:
wget https://raw.github.com/ssvb/cpuburn/master/cpuburn-a9.S arm-linux-gnueabihf-gcc -o cpuburn-a9 cpuburn-a9.S
And then run it for exercising a really heavy multi-threaded workload on the CPU:
./cpuburn-a9
It would be also a good idea to verify that all the CPU cores are fully loaded.
On 4 July 2013 17:34, Siarhei Siamashka siarhei.siamashka@gmail.com wrote:
For getting reproducible benchmark results, you just need to ensure that thermal throttling never kicks in. If the kernel is compiled with cpufreq stats enabled, you can compare these stats before/after your benchmark to ensure that it spent all the time running at the same designated clock frequency.
I did that on my Chromebook, put it on power mode and I get pretty consistent build and benchmark times. It's an art to make sure the benchmark run-time is enough to give you statistically relevant results while not being too much to deal with overheating or scheduling issues, but that, as you say, can be "fixed" by running on lower frequencies, I don't mind about that.
What I really mind is to lower the frequency of our buildbots, the ones that should be building and testing under 20 minutes (like octo-core i7s) but take 3 hours to do so (on a dual Panda). While the comparison is in no way fair, reducing the freq. will only make it worse. Coming from a server farm culture, where noise, power and air-conditioning are always topped up and never too expensive, it's hard not to giggle when hearing that you should lower the frequency to get "expected results".
Yes, ARM devices were designed with the phone market in mind, but today they're a lot more than that, and if they're to get into the server space, they have to be consistent, even when cranked up all the way to 11.
Anyway, I recommend you to start the tests for the hardware
robustness with:
wget https://raw.github.com/ssvb/cpuburn/master/cpuburn-a9.S arm-linux-gnueabihf-gcc -o cpuburn-a9 cpuburn-a9.S
I'll do that and report on my findings.
Thanks for the overall, it was very educational. ;)
cheers, --renato
On Wed, 3 Jul 2013 15:59:47 +0100 Mans Rullgard mans.rullgard@linaro.org wrote:
On 3 July 2013 14:13, Renato Golin renato.golin@linaro.org wrote:
Hi Folks,
I'm running two buildbots here at home and am getting consistent failures from the Pandas because of overheating. I've set up a monitor that will tell me the current CPU temperature and the allowed maximum, and when the bot passes 90%, it shuts itself off.
The problem is that I'm running with heat-sinks and the boards are on top of three fans, so there really isn't much more I can do to solve this problem.
I personally think this is a hardware problem, since everything is in the same die, CPU, GPU and RAM, and the physical dimensions of the chip are quite small. I remember when Intel started overheating (around 486DX66) and the die was huge (more head dissipation), plus RAM and GPU were separate, and it still needed a hefty heat-sink.
It's true that gates are far smaller today, but it's not true that a dual core 1.3GHz + GPU + RAM will produce less heat on a small die than a 66KHz CPU on a huge die, so why anyone think it's a good idea to release a 1+GHz chip without *any* form of heat dissipation is beyond my comprehension.
Modern silicon processes are much more power-efficient than those of the 90s. For example, an old ~500MHz Alpha machine I have readily consumes 90W even when idle. A quad-core Intel i7 typically has a TDP of 130W at full load. That's orders of magnitude more gates clocked at 6x the frequency and still using only marginally more power.
BTW, the RAM is a separate chip mounted on top of the SoC.
Manufacturers only got away with it, so far, because people rarely use 100% of the CPU power for extended periods of time, because ARM devices end up as set-top boxes, mobile phones and tablets. However, even those devices will heat up when playing 2 h films or games, and they do have some form of heat sink.
An OMAP4460 will run at 1.2GHz indefinitely without overheating in reasonable ambient temperature. The higher frequencies are only meant to be used in conjunction with (software) thermal management to throttle back if temperature rises.
If you don't have thermal management in the kernel you're running, you need to clamp the clock at a safe value.
By the way, power consumption is not constant and heavily depends on what the CPU is actually doing. And 100% CPU load in one application does not mean that it would consume the same amount of power as 100% CPU load in another application. With some targeted "optimisations" it is possible to boost power consumption roughly by a factor of 1.5x compared to most heavy workloads in real applications. I have a collection of ARM cpuburn programs, empirically tuned for different microarchitectures (which means that they still can be possibly "improved"):
https://github.com/ssvb/cpuburn
It is possible that Cortex-A15 would show a similar ~1.5x factor for the power consumption boost if somebody were to tune cpuburn for it. But I'm a bit reluctant to dismantle my ARM Chromebook to hook a multimeter there (developer boards with no batteries and with barrel power connectors are much more easy to deal with).
Some time ago, I tossed my Cortex-A9 cpuburn to the ODROID-X people. And coincidentally they quickly got the thermal framework properly integrated into their kernels and also started to offer optional active coolers to their customers :-)
Now if you also consider that SoCs usually have a lot more than just the CPU cores, the peak power consumption can be really high. Designing the cooling system so that it is able to handle the peak power consumption is a bit of an overkill. It is going to be expensive and/or bulky. And just restricting the CPU clock frequency so that the power consumption never exceeds a certain threshold, you are going to end up clocking the CPU at a really low speed. In my opinion, the right solution for modern ARM SoCs is just to always ensure proper throttling support (both in the hardware and in the software). ARM can even call it "turbo-boost", "turbo-core" or use some other marketing buzzword ;-)
On 4 July 2013 17:13, Siarhei Siamashka siarhei.siamashka@gmail.com wrote:
By the way, power consumption is not constant and heavily depends on what the CPU is actually doing. And 100% CPU load in one application does not mean that it would consume the same amount of power as 100% CPU load in another application.
This is really interesting, I had not considered it until now. If I understood correctly, this has to do with what/how many paths are taken inside the cores (CPU, GPU), or how much data is passing between mem/cache/registers, etc.
For toolchain, there isn't much of floating going on, but if your compiler was auto-vectorized, you'll probably be using NEON, and there will be a lot of data movement, too, so I'm guessing compilers can stretch quite a lot the CPU overall. And since building a large project (like GCC or LLVM) takes several hours with very little happening outside the CPU, there isn't much time to cool down the CPU between compilation jobs.
Some time ago, I tossed my Cortex-A9 cpuburn to the ODROID-X people. And coincidentally they quickly got the thermal framework properly integrated into their kernels and also started to offer optional active coolers to their customers :-)
Hahahaha! Yes, that's what I'm talking about. I don't think anyone did that with Pandas or Arndales, and somebody really should.
In my opinion, the right
solution for modern ARM SoCs is just to always ensure proper throttling support (both in the hardware and in the software). ARM can even call it "turbo-boost", "turbo-core" or use some other marketing buzzword ;-)
Absolutely! Though, while throttling is the way to go, it might be simple to wait for it with a decent cooling solution than with a lower frequency. ODroid folks seem to have understood that pretty well.
It would be a lot easier to convince hardware vendors and cluster builders to buy huge active coolers, than convince them to lower the CPU frequency. The former show failure in software support, but the latter show failure in system design...
cheers, --renato
On 4 July 2013 18:10, Renato Golin renato.golin@linaro.org wrote:
On 4 July 2013 17:13, Siarhei Siamashka siarhei.siamashka@gmail.com wrote:
By the way, power consumption is not constant and heavily depends on what the CPU is actually doing. And 100% CPU load in one application does not mean that it would consume the same amount of power as 100% CPU load in another application.
This is really interesting, I had not considered it until now. If I understood correctly, this has to do with what/how many paths are taken inside the cores (CPU, GPU), or how much data is passing between mem/cache/registers, etc.
Modern CPU designs can even clock-gate partial pipelines when not in use. Typical code doesn't even use the multiply pipeline most of the time, so it will spend a lot of time gated. A carefully crafted piece of code, like Siarhei's, maintains the maximum sustained issue rate for a long time, and mixes instructions such that most of the pipelines are active most of the time. This makes the power consumption go up significantly.
It would be a lot easier to convince hardware vendors and cluster builders to buy huge active coolers, than convince them to lower the CPU frequency.
Chips intended for compute clusters will no doubt be possible to cool sufficiently to run at full speed all the time. Designing chips for different markets involves different sets of tradeoffs, and you're seeing the result of that.
On 4 July 2013 19:15, Mans Rullgard mans.rullgard@linaro.org wrote:
Chips intended for compute clusters will no doubt be possible to cool sufficiently to run at full speed all the time. Designing chips for different markets involves different sets of tradeoffs, and you're seeing the result of that.
Yes, this is what I'm trying to get at. For toolchain testing we need a machine that doesn't give up under high constant load for really long periods (months/years). This is slightly lighter than the kind of load that you'll have when using servers like Calxeda, for uses like Facebook's. On mobile platforms, the diversity of uses is really reduced, so it's ok to make several compromises on the chip/SoC/system design to save on costs. But when ARM cores hit the desktop/server market, these assumptions will stop being valid.
This video of Linux Torvalds on why Linux haven't dominated the desktop market yet (and may never will) is relevant:
http://www.youtube.com/watch?v=ZPUk1yNVeEI
Basically, the usage patterns are so disparate between users, or even groups of users, that it's hard for any single Linux distribution / vendor to focus on all of them. ARM is similar, that itself can't focus on servers or desktops only, but if the designs allow for modularity (I believe they do), partners could (should) build SoCs focused on different markets, vendors (like HP, Dell) could put together production systems, etc.
Both ARM and Linux move into desktops would need a level of coordination between competitors that is probably not possible on today's market. (please, somebody tell me I'm wrong...)
cheers, --renato
On 3 July 2013 14:13, Renato Golin renato.golin@linaro.org wrote:
Hi Folks,
I'm running two buildbots here at home and am getting consistent failures from the Pandas because of overheating. I've set up a monitor that will tell me the current CPU temperature and the allowed maximum, and when the bot passes 90%, it shuts itself off.
It may also be worth examining your power supplies and see if they are providing enough current to run the chip this hot reliably. A bench supply could eliminate this possibility conclusively.
-- Will Newton Toolchain Working Group, Linaro
On 3 July 2013 23:01, Will Newton will.newton@linaro.org wrote:
It may also be worth examining your power supplies and see if they are providing enough current to run the chip this hot reliably. A bench supply could eliminate this possibility conclusively.
They're cheap... *very* cheap... They're not the ones Linaro uses in the lab most of the time, but are the ones Linaro has loads of in the "power supply" drawer, and the ones that websites show you as "PandaBoard power supply".
Not this one:
http://www.digikey.com/product-detail/en/PSAC30U-050/993-1019-ND/2384432?cur...
This one:
http://www.amazon.co.uk/Pandaboard-Board-replacement-supply-adaptor/dp/B0087...
The difference in price tells you a lot... ;)
This was my conclusion when my Panda at home, on idle, was locking up. It wasn't turning off every time, some times it'd just lock and have one LED constantly on and the other constantly off, sometimes it'd shutdown completely, and some times the screen would freeze, but it'd still be "running". With many other appliances connected to the same socket (TV, Sky, PS3, printer, etc), the spikes could be causing trouble.
The boards now have run overnight at 920MHz without a glitch, though they are understandably 50% slower. I'll see how they behave during today, and if they don't fail, I'll conclude that it was, indeed, the frequency, not the power supply.
cheers, --renato
On 4 July 2013 09:44, Renato Golin renato.golin@linaro.org wrote:
On 3 July 2013 23:01, Will Newton will.newton@linaro.org wrote:
It may also be worth examining your power supplies and see if they are providing enough current to run the chip this hot reliably. A bench supply could eliminate this possibility conclusively.
They're cheap... *very* cheap... They're not the ones Linaro uses in the lab most of the time, but are the ones Linaro has loads of in the "power supply" drawer, and the ones that websites show you as "PandaBoard power supply".
Not this one:
http://www.digikey.com/product-detail/en/PSAC30U-050/993-1019-ND/2384432?cur...
This one:
http://www.amazon.co.uk/Pandaboard-Board-replacement-supply-adaptor/dp/B0087...
What is the output current of this PSU? I tried running pandaboard with 2.5A PSU and it didn't even start. 3A seems to be the minimum.
milosz
The difference in price tells you a lot... ;)
This was my conclusion when my Panda at home, on idle, was locking up. It wasn't turning off every time, some times it'd just lock and have one LED constantly on and the other constantly off, sometimes it'd shutdown completely, and some times the screen would freeze, but it'd still be "running". With many other appliances connected to the same socket (TV, Sky, PS3, printer, etc), the spikes could be causing trouble.
The boards now have run overnight at 920MHz without a glitch, though they are understandably 50% slower. I'll see how they behave during today, and if they don't fail, I'll conclude that it was, indeed, the frequency, not the power supply.
cheers, --renato
linaro-validation mailing list linaro-validation@lists.linaro.org http://lists.linaro.org/mailman/listinfo/linaro-validation
I've just had a quick glance at the rack in LAVA Lab with the pandas in it, and it seems like we are using the 5V 4A brick style power supplies, not the cheap ones.
Matt
On 4 July 2013 10:00, Milosz Wasilewski milosz.wasilewski@linaro.orgwrote:
On 4 July 2013 09:44, Renato Golin renato.golin@linaro.org wrote:
On 3 July 2013 23:01, Will Newton will.newton@linaro.org wrote:
It may also be worth examining your power supplies and see if they are providing enough current to run the chip this hot reliably. A bench supply could eliminate this possibility conclusively.
They're cheap... *very* cheap... They're not the ones Linaro uses in the
lab
most of the time, but are the ones Linaro has loads of in the "power
supply"
drawer, and the ones that websites show you as "PandaBoard power supply".
Not this one:
http://www.digikey.com/product-detail/en/PSAC30U-050/993-1019-ND/2384432?cur...
This one:
http://www.amazon.co.uk/Pandaboard-Board-replacement-supply-adaptor/dp/B0087...
What is the output current of this PSU? I tried running pandaboard with 2.5A PSU and it didn't even start. 3A seems to be the minimum.
milosz
The difference in price tells you a lot... ;)
This was my conclusion when my Panda at home, on idle, was locking up. It wasn't turning off every time, some times it'd just lock and have one LED constantly on and the other constantly off, sometimes it'd shutdown completely, and some times the screen would freeze, but it'd still be "running". With many other appliances connected to the same socket (TV,
Sky,
PS3, printer, etc), the spikes could be causing trouble.
The boards now have run overnight at 920MHz without a glitch, though they are understandably 50% slower. I'll see how they behave during today,
and if
they don't fail, I'll conclude that it was, indeed, the frequency, not
the
power supply.
cheers, --renato
linaro-validation mailing list linaro-validation@lists.linaro.org http://lists.linaro.org/mailman/listinfo/linaro-validation
linaro-validation mailing list linaro-validation@lists.linaro.org http://lists.linaro.org/mailman/listinfo/linaro-validation
On 4 July 2013 10:01, Matt Hart matthew.hart@linaro.org wrote:
I've just had a quick glance at the rack in LAVA Lab with the pandas in it, and it seems like we are using the 5V 4A brick style power supplies, not the cheap ones.
I know, but somewhere in the lab there's a box full of the cheap ones, and these are the ones I used for my buildbots, and the ones I bought for me at home.
--renato
On 4 July 2013 10:00, Milosz Wasilewski milosz.wasilewski@linaro.orgwrote:
What is the output current of this PSU? I tried running pandaboard with 2.5A PSU and it didn't even start. 3A seems to be the minimum.
These are 5V and on my multimeter I got almost 6V, but it's not just the voltage, but the constant supply of current.
Since this supply is very cheap, it doesn't have a way of ensuring constant current when the current is being temporarily diverged to another socket because of a peak usage from another device. I only have a small Atom server and my laptop, so there isn't much that could cause any substantial lack of current.
cheers, --renato
On 4 July 2013 10:08, Renato Golin renato.golin@linaro.org wrote:
On 4 July 2013 10:00, Milosz Wasilewski milosz.wasilewski@linaro.org wrote:
What is the output current of this PSU? I tried running pandaboard with 2.5A PSU and it didn't even start. 3A seems to be the minimum.
These are 5V and on my multimeter I got almost 6V, but it's not just the voltage, but the constant supply of current.
Since this supply is very cheap, it doesn't have a way of ensuring constant current when the current is being temporarily diverged to another socket because of a peak usage from another device.
That is not how electricity works.
On 4 July 2013 11:29, Mans Rullgard mans.rullgard@linaro.org wrote:
That is not how electricity works.
I may not have myself clear, I suppose... We can digress at Connect about electricity.
cheers, --renato
Folks,
I had my final round of tests and I can say that there is no final conclusion on why they fail, but they do failed under every scenario I could try them on.
I've tested 3 identical boards (Panda-ES RevB2) with 5 different power supplies. Even at 920MHz, with decent power supplies (high-quality 5V/4A, the ones used in the lab) they fail at 70% of their target temperatures, at least since the last measurement (<1min before failing). So, unless they overheat in less than a minute, for no apparent reason, and get hot enough to make the plastic case be a nuisance to heat transfer, they're not really failing because of heat. Power supplies also very cool, so I doubt they're at fault.
There isn't absolutely anything on the logs, no kernel panic, no error message, nothing. Since there is no indication that lowering the frequency to 700MHz will make any difference (heat issue was indeed very likely a red herring), I'm basically giving up on the Pandas. They were either not meant to run for long times at full capacity, or our kernel (Linaro 3.5.0-213-omap4) is not up to the task (which is worrying). But since I'm not a kernel engineer, there isn't much I can do from where I stand.
If anyone want to continue the investigation, on a kernel level, I can help set up the boards, but now I need to re-focus on more pressing issues.
cheers, --renato
linaro-toolchain@lists.linaro.org