On Wed, 3 Jul 2013 15:59:47 +0100 Mans Rullgard mans.rullgard@linaro.org wrote:
On 3 July 2013 14:13, Renato Golin renato.golin@linaro.org wrote:
Hi Folks,
I'm running two buildbots here at home and am getting consistent failures from the Pandas because of overheating. I've set up a monitor that will tell me the current CPU temperature and the allowed maximum, and when the bot passes 90%, it shuts itself off.
The problem is that I'm running with heat-sinks and the boards are on top of three fans, so there really isn't much more I can do to solve this problem.
I personally think this is a hardware problem, since everything is in the same die, CPU, GPU and RAM, and the physical dimensions of the chip are quite small. I remember when Intel started overheating (around 486DX66) and the die was huge (more head dissipation), plus RAM and GPU were separate, and it still needed a hefty heat-sink.
It's true that gates are far smaller today, but it's not true that a dual core 1.3GHz + GPU + RAM will produce less heat on a small die than a 66KHz CPU on a huge die, so why anyone think it's a good idea to release a 1+GHz chip without *any* form of heat dissipation is beyond my comprehension.
Modern silicon processes are much more power-efficient than those of the 90s. For example, an old ~500MHz Alpha machine I have readily consumes 90W even when idle. A quad-core Intel i7 typically has a TDP of 130W at full load. That's orders of magnitude more gates clocked at 6x the frequency and still using only marginally more power.
BTW, the RAM is a separate chip mounted on top of the SoC.
Manufacturers only got away with it, so far, because people rarely use 100% of the CPU power for extended periods of time, because ARM devices end up as set-top boxes, mobile phones and tablets. However, even those devices will heat up when playing 2 h films or games, and they do have some form of heat sink.
An OMAP4460 will run at 1.2GHz indefinitely without overheating in reasonable ambient temperature. The higher frequencies are only meant to be used in conjunction with (software) thermal management to throttle back if temperature rises.
If you don't have thermal management in the kernel you're running, you need to clamp the clock at a safe value.
By the way, power consumption is not constant and heavily depends on what the CPU is actually doing. And 100% CPU load in one application does not mean that it would consume the same amount of power as 100% CPU load in another application. With some targeted "optimisations" it is possible to boost power consumption roughly by a factor of 1.5x compared to most heavy workloads in real applications. I have a collection of ARM cpuburn programs, empirically tuned for different microarchitectures (which means that they still can be possibly "improved"):
https://github.com/ssvb/cpuburn
It is possible that Cortex-A15 would show a similar ~1.5x factor for the power consumption boost if somebody were to tune cpuburn for it. But I'm a bit reluctant to dismantle my ARM Chromebook to hook a multimeter there (developer boards with no batteries and with barrel power connectors are much more easy to deal with).
Some time ago, I tossed my Cortex-A9 cpuburn to the ODROID-X people. And coincidentally they quickly got the thermal framework properly integrated into their kernels and also started to offer optional active coolers to their customers :-)
Now if you also consider that SoCs usually have a lot more than just the CPU cores, the peak power consumption can be really high. Designing the cooling system so that it is able to handle the peak power consumption is a bit of an overkill. It is going to be expensive and/or bulky. And just restricting the CPU clock frequency so that the power consumption never exceeds a certain threshold, you are going to end up clocking the CPU at a really low speed. In my opinion, the right solution for modern ARM SoCs is just to always ensure proper throttling support (both in the hardware and in the software). ARM can even call it "turbo-boost", "turbo-core" or use some other marketing buzzword ;-)