Thomas Zimmermann tzimmermann@suse.de writes:
Hi
Am 27.11.23 um 17:25 schrieb Alyssa Ross:
Thomas Zimmermann tzimmermann@suse.de writes:
Invoke drm_plane_helper_funcs.end_fb_access before drm_atomic_helper_commit_hw_done(). The latter function hands over ownership of the plane state to the following commit, which might free it. Releasing resources in end_fb_access then operates on undefined state. This bug has been observed with non-blocking commits when they are being queued up quickly.
Here is an example stack trace from the bug report. The plane state has been free'd already, so the pages for drm_gem_fb_vunmap() are gone.
Unable to handle kernel paging request at virtual address 0000000100000049 [...] drm_gem_fb_vunmap+0x18/0x74 drm_gem_end_shadow_fb_access+0x1c/0x2c drm_atomic_helper_cleanup_planes+0x58/0xd8 drm_atomic_helper_commit_tail+0x90/0xa0 commit_tail+0x15c/0x188 commit_work+0x14/0x20
For aborted commits, it is still ok to run end_fb_access as part of the plane's cleanup. Add a test to drm_atomic_helper_cleanup_planes().
v2:
- fix test in drm_atomic_helper_cleanup_planes()
Reported-by: Alyssa Ross hi@alyssa.is Closes: https://lore.kernel.org/dri-devel/87leazm0ya.fsf@alyssa.is/ Suggested-by: Daniel Vetter daniel@ffwll.ch Fixes: 94d879eaf7fb ("drm/atomic-helper: Add {begin,end}_fb_access to plane helpers") Signed-off-by: Thomas Zimmermann tzimmermann@suse.de Cc: stable@vger.kernel.org # v6.2+
drivers/gpu/drm/drm_atomic_helper.c | 17 +++++++++++++++++ 1 file changed, 17 insertions(+)
Got this basically immediately. :(
I've never seen such problems on other systems. Is there anything different about the Mac systems? How do you trigger these errors?
My understanding is that all sorts of things are different, but I don't know too much about the details. There's of course a chance that there could be some other change in the Asahi Linux kernel that causes this problem to surface — as I said, I reviewed the diff with mainline and didn't see anything that looked relevant, but I could well have missed something. I don't think I can test mainline directly, as it doesn't yet support enough of the hardware — for slightly older Apple Silicon Mac models, I think enough is upstream that this would be possible, but I don't have access to any.
I started off encountering these errors every few days. I noticed them because they would sometimes result in my system either starting to freeze for 10 seconds at a time, or until I switched VT. They seem to correlate with the system being under high CPU load. I was also able to substantially increase the frequency with which they occurred by adding logging to the kernel — even just drm.debug=0x10 makes a big difference, and when I also added a few dump_backtrace() calls when I was trying to understand the code and diagnose the problem, I would relatively consistently encounter an Oops within a few minutes of load.
BTW: v3 is looking good so far. I've only been testing it since this morning, though, so I'll keep trying it out for a bit longer before I declare the problem to have been solved and send a Tested-by.