On Mon, Aug 15, 2022 at 03:12:19PM -0700, John Hubbard wrote:
On 8/15/22 07:41, Mel Gorman wrote:
For the rest, I didn't see obvious recovery paths that would allow the system to run predictably. Any of them firing will have unpredictable consequences (e.g. move_queued_task firing would be fun if it was a per-cpu kthread). Depending on which warning triggers, the remaining life of the system may be very short but maybe long enough to be logged even if system locks up shortly afterwards.
Hi Mel,
Are you basically saying, "WARN_ON*() seems acceptable in mm/, because we can at least get the problem logged before it locks up, probably"?
I don't consider this to be a subsystem-specific guideline and I am certainly not suggesting that mm/ is the gold standard but my understanding was "If a warning is recoverable in a sensible fashion then do so". A warning is always bad, but it may be possible for the system to continue long enough for the warning to be logged or an admin to schedule a reboot at a controlled time. I think Ingo is doing the right thing with this patch to avoid an unnecessary panic but for at least some of them, a partial recovery is possible so why not do it seeing as it's been changed anyway?
The outcome of a warning is very variable, it might be a coding error where state is unexpected but it can be reset to a known state, maybe some data is leaked that if it persists we eventually OOM, maybe a particular feature will no longer work as expected (e.g. load balancing doesn't balance load due to a warning) or maybe the whole system will fail completely in the near future, possibly before an error can be logged. In the scheduler side, a warning may mean that a task never runs again and another means that a task is in the wrong scheduling class which means ... no idea, but it's not good if it's a deadline task that gets manipulated by the normal scheduling class instead.
For mm/, take three examples
1. compaction.c warning. PFN ranges are unexpected, reset them. Compaction is less effective and opportunities were missed but the system will not blow up. Maybe at worst a hugetlb allocation would fail when it might otherwise have succeeded or maybe SLUB degrades because it can't use a high-order page for a cache that could reside in an order-0 page instead. It varies.
2. Warning in in filemap_unaccount_folio -- possible loss of unwritten data on normal filesystems. Bad, might cause inconsistency for state but maybe it'll be dirtied again and it'll recover eventually. It's certainly not good if that data got lost forever.
3. The warning in isolate_movable_page is bad, page flags state is either corrupted by a driver or maybe there was a bit flip error but a page may be migrated unexpectedly leading to Who Knows What, corruption of a page that gets freed and reused by another process? It's much worse than compaction restoring its internal state of where it is scanning but maybe not a complete disaster.
Or are you implying that we should instead introduce and use some new PANIC_ON() or VM_PANIC_ON() set of macros that would allow proper "log then reboot" behavior?
No, not as a general rule. VM_PANIC_ON would suggest it's under DEBUG_VM which is not always set and I do not see how either PANIC_ON or VM_PANIC_ON could be coded in such a way that it always gets logged because it depends on the context the bad condition occurred.
I also don't think the kernel could conditionally warn or panic for a specific instance because if an admin wants a panic on a warning (can be very useful for a crash dump), then panic_on_warn is available.
Either way, PANIC_ON or VM_PANIC_ON is outside the scope of this patch.
My understanding was that a panic() should only be used in the case where the kernel is screwed and if tries to continue, the outcome will be much worse or it's early in boot and the error condition means it'll be impossible to boot anyway.