On Tue, Aug 9, 2022 at 12:09 PM Jason Gunthorpe jgg@nvidia.com wrote:
Since BUG_ON crashes the machine and Linus says that crashing the machine is bad, WARN_ON will also crash the machine if you set the panic_on_warn parameter, so it is also bad, thus we shouldn't use anything.
If you set 'panic_on_warn' you get to keep both pieces when something breaks.
The thing is, there are people who *do* want to stop immediately when something goes wrong in the kernel.
Anybody doing large-scale virtualization presumably has all the infrastructure to get debug info out of the virtual environment.
And people who run controlled loads in big server machine setups and have a MIS department to manage said machines typically also prefer for a machine to just crash over continuing.
So in those situations, a dead machine is still a dead machine, but you get the information out, and panic_on_warn is fine, because panic and reboot is fine.
And yes, that's actually a fairly common case. Things like syzkaller etc *wants* to abort on the first warning, because that's kind of the point.
But while that kind of virtualized automation machinery is very very common, and is a big deal, it's by no means the only deal, and the most important thing to the point where nothing else matters.
And if you are *not* in a farm, and if you are *not* using virtualization, a dead machine is literally a useless brick. Nobody has serial lines on individual machines any more. In most cases, the hardware literally doesn't even exist any more.
So in that situation, you really cannot afford to take the approach of "just kill the machine". If you are on a laptop and are doing power management code, you generally cannot do that in a virtual environment, and you already have enough problems with suspend and resume being hard to debug, without people also going "oh, let's just BUG_ON() and kill the machine".
Because the other side of that "we have a lot of machine farms doing automated testing" is that those machine farms do not generally find a lot of the exciting cases.
Almost every single merge window, I end up having to bisect and report an oops or a WARN_ON(), because I actually run on real hardware. And said problem was never seen in linux-next.
So we have two very different cases: the "virtual machine with good logging where a dead machine is fine" - use 'panic_on_warn'. And the actual real hardware with real drivers, running real loads by users.
Both are valid. But the second case means that BUG_ON() is basically _never_ valid.
Linus