Hi Folks,

I'm running two buildbots here at home and am getting consistent failures from the Pandas because of overheating. I've set up a monitor that will tell me the current CPU temperature and the allowed maximum, and when the bot passes 90%, it shuts itself off.

The problem is that I'm running with heat-sinks and the boards are on top of three fans, so there really isn't much more I can do to solve this problem.

I personally think this is a hardware problem, since everything is in the same die, CPU, GPU and RAM, and the physical dimensions of the chip are quite small. I remember when Intel started overheating (around 486DX66) and the die was huge (more head dissipation), plus RAM and GPU were separate, and it still needed a hefty heat-sink.

It's true that gates are far smaller today, but it's not true that a dual core 1.3GHz + GPU + RAM will produce less heat on a small die than a 66KHz CPU on a huge die, so why anyone think it's a good idea to release a 1+GHz chip without *any* form of heat dissipation is beyond my comprehension.

Manufacturers only got away with it, so far, because people rarely use 100% of the CPU power for extended periods of time, because ARM devices end up as set-top boxes, mobile phones and tablets. However, even those devices will heat up when playing 2 h films or games, and they do have some form of heat sink.

We, at the toolchain group, make things worse by using 100% CPU, 24 / 7, something that Panda boards, or Arndales were not designed to do. However, with ARM moving into the server space, their designs will have to be re-thought, and what a better place than Linaro for making sure we get it right?

For the time being, I believe we *must* have air conditioning in the Lab all the time, and we *must* have heat-sinks on every board, and we *must* monitor the CPU temperature of the boards, at least until we're comfortable that they're not failing all the time.

Can we make a temperature monitor (like the one attached) a default feature on Linaro Ubuntu distributions? We could dump that info to the syslog/dmesg whenever it crosses the (say) 75% threshold, and report more often when it crosses the 95%, possibly dumping the processe(s) that are consuming more CPU at the time, to enable post-mortem debugging.

cheers,
--renato

As a side note, the quad-A9 ODroid does ship with a massive heat-sink, which also serves as a fancy case. Quite clever, really.