I believe that in the LAVA lab there are a few pandas with USB keys that are used for builds to try and overcome some reliability problems. Don't know if it was a temperature problem or something else. With any luck someone who knows more about that issue can speak up and share what they found. You could also try running "stress --cpu 4 --vm 2" and see if any errors show. I find that on my desktop running 2x the number of CPU stress threads as I have CPUs is about right to eat all available resources. That will just stress RAM and CPU, not disk I/O, which should pinpoint the problem. Plenty of other options (http://www.hecticgeek.com/2012/11/stress-test-your-ubuntu-computer-with-stre...)...
Is running at 100% of the thermal limit really an issue? Isn't the point that it is the limit, which itself should have some safety built in? I don't know off hand if the OMAP 4 SoCs incorporate hardware frequency limiting or if it is entirely software, in which case the kernel frequency governor should (at a guess) be throttling back.
I did have a panda give up on me about a year ago. It wasn't being worked hard, but did refuse to get through a boot most of the time (it did power on and get part way through booting). Those boards aren't designed for high reliability and it may be that you just need to get a couple of replacements.
James
On 3 July 2013 14:13, Renato Golin renato.golin@linaro.org wrote:
Hi Folks,
I'm running two buildbots here at home and am getting consistent failures from the Pandas because of overheating. I've set up a monitor that will tell me the current CPU temperature and the allowed maximum, and when the bot passes 90%, it shuts itself off.
The problem is that I'm running with heat-sinks and the boards are on top of three fans, so there really isn't much more I can do to solve this problem.
I personally think this is a hardware problem, since everything is in the same die, CPU, GPU and RAM, and the physical dimensions of the chip are quite small. I remember when Intel started overheating (around 486DX66) and the die was huge (more head dissipation), plus RAM and GPU were separate, and it still needed a hefty heat-sink.
It's true that gates are far smaller today, but it's not true that a dual core 1.3GHz + GPU + RAM will produce less heat on a small die than a 66KHz CPU on a huge die, so why anyone think it's a good idea to release a 1+GHz chip without *any* form of heat dissipation is beyond my comprehension.
Manufacturers only got away with it, so far, because people rarely use 100% of the CPU power for extended periods of time, because ARM devices end up as set-top boxes, mobile phones and tablets. However, even those devices will heat up when playing 2 h films or games, and they do have some form of heat sink.
We, at the toolchain group, make things worse by using 100% CPU, 24 / 7, something that Panda boards, or Arndales were not designed to do. However, with ARM moving into the server space, their designs will have to be re-thought, and what a better place than Linaro for making sure we get it right?
For the time being, I believe we *must* have air conditioning in the Lab all the time, and we *must* have heat-sinks on every board, and we *must* monitor the CPU temperature of the boards, at least until we're comfortable that they're not failing all the time.
Can we make a temperature monitor (like the one attached) a default feature on Linaro Ubuntu distributions? We could dump that info to the syslog/dmesg whenever it crosses the (say) 75% threshold, and report more often when it crosses the 95%, possibly dumping the processe(s) that are consuming more CPU at the time, to enable post-mortem debugging.
cheers, --renato
As a side note, the quad-A9 ODroid does ship with a massive heat-sink, which also serves as a fancy case. Quite clever, really.
linaro-validation mailing list linaro-validation@lists.linaro.org http://lists.linaro.org/mailman/listinfo/linaro-validation