On 3 July 2013 14:13, Renato Golin renato.golin@linaro.org wrote:
Hi Folks,
I'm running two buildbots here at home and am getting consistent failures from the Pandas because of overheating. I've set up a monitor that will tell me the current CPU temperature and the allowed maximum, and when the bot passes 90%, it shuts itself off.
The problem is that I'm running with heat-sinks and the boards are on top of three fans, so there really isn't much more I can do to solve this problem.
I personally think this is a hardware problem, since everything is in the same die, CPU, GPU and RAM, and the physical dimensions of the chip are quite small. I remember when Intel started overheating (around 486DX66) and the die was huge (more head dissipation), plus RAM and GPU were separate, and it still needed a hefty heat-sink.
It's true that gates are far smaller today, but it's not true that a dual core 1.3GHz + GPU + RAM will produce less heat on a small die than a 66KHz CPU on a huge die, so why anyone think it's a good idea to release a 1+GHz chip without *any* form of heat dissipation is beyond my comprehension.
Modern silicon processes are much more power-efficient than those of the 90s. For example, an old ~500MHz Alpha machine I have readily consumes 90W even when idle. A quad-core Intel i7 typically has a TDP of 130W at full load. That's orders of magnitude more gates clocked at 6x the frequency and still using only marginally more power.
BTW, the RAM is a separate chip mounted on top of the SoC.
Manufacturers only got away with it, so far, because people rarely use 100% of the CPU power for extended periods of time, because ARM devices end up as set-top boxes, mobile phones and tablets. However, even those devices will heat up when playing 2 h films or games, and they do have some form of heat sink.
An OMAP4460 will run at 1.2GHz indefinitely without overheating in reasonable ambient temperature. The higher frequencies are only meant to be used in conjunction with (software) thermal management to throttle back if temperature rises.
If you don't have thermal management in the kernel you're running, you need to clamp the clock at a safe value.