Folks,
I had my final round of tests and I can say that there is no final conclusion on why they fail, but they do failed under every scenario I could try them on.
I've tested 3 identical boards (Panda-ES RevB2) with 5 different power supplies. Even at 920MHz, with decent power supplies (high-quality 5V/4A, the ones used in the lab) they fail at 70% of their target temperatures, at least since the last measurement (<1min before failing). So, unless they overheat in less than a minute, for no apparent reason, and get hot enough to make the plastic case be a nuisance to heat transfer, they're not really failing because of heat. Power supplies also very cool, so I doubt they're at fault.
There isn't absolutely anything on the logs, no kernel panic, no error message, nothing. Since there is no indication that lowering the frequency to 700MHz will make any difference (heat issue was indeed very likely a red herring), I'm basically giving up on the Pandas. They were either not meant to run for long times at full capacity, or our kernel (Linaro 3.5.0-213-omap4) is not up to the task (which is worrying). But since I'm not a kernel engineer, there isn't much I can do from where I stand.
If anyone want to continue the investigation, on a kernel level, I can help set up the boards, but now I need to re-focus on more pressing issues.
cheers,
--renato