On 10/12/2021 14:46, Thomas Hellström wrote:
On Fri, 2021-12-10 at 11:05 +0000, Tvrtko Ursulin wrote:
From: Tvrtko Ursulin tvrtko.ursulin@intel.com
This effectively removes writeback which was added in 2d6692e642e7 ("drm/i915: Start writeback from the shrinker").
Digging through the history it seems we went back and forth on the topic of whether it would be safe a couple of times. See for instance 5537252b6b6d ("drm/i915: Invalidate our pages under memory pressure") where Hugh Dickins has advised against it. I do not have enough expertise in the memory management area so am hoping for expert input here.
Reason for proposing removal is that there are reports from the field which indicate a sysetm wide deadlock (of a sort) implicating i915 doing writeback at shrinking time.
Signature is a hung task notifier kicking in and task traces such as:
It would be interesting to see what exactly the find_get_entry is blocked on. The other two tasks are blocked on the shrinker_rwsem which is held by i915. If it's indeed a deadlock with either of those two,
It may indeed be a livelock instead of a deadlock. I have received a newer trace and it indeed shows kswapd in running state. But no progress in 120s and dead machine sounded like too suspicious it could happen with just a gaming workload so I assumed a more serious issue than just severe memory pressure.
then the fix Chris is working on for an unrelated issue we discovered with shrinking would move out the writeback call from the shrinker_rwsem and resolve this, but if i915 is in turn deadlocking with another process and these two are just hanging waiting for the shrinker_rwsem, we would still have other issues.
Presumably this would involve an extra worker and tracking on a list or something?
Otherwise my main hope really was to get a verdict from memory management experts on pros & cons of doing writeback from the driver in any flavour.
Do you by any chance have the list of the locks held by the system at this point?
No, but maybe Renato you could also collect "echo d" and "echo m" to sysrq-trigger when things go bad?
Regards,
Tvrtko