I've cc'ed some folks in hopes to get this resolved upstream.
Either way, 4.1's EoL was previously moved to about 6 months from now, so hopefully we'll have more than enough time to get this resolved.
On Sat, Nov 11, 2017 at 10:13:55PM +0000, Tuncer Ayaz wrote:
The predicament I'm in on my machines is that ever since drm-intel has implemented atomic modesetting, there's a list regressions caused by those fundamental architecture changes and the code churn it implied. This means 4.1 is (from what I can tell) the last kernel before atomic modesetting was added and the only kernel free of all those issues which necessitate trying out various combinations of flags on the kernel cmdline.
For instance, right now I'm trying 4.13.12 with these flags: video=SVIDEO-1:d i915.semaphores=1 i915.enable_rc6=0 i915.enable_psr=0 intel_iommu=igfx_off
PS: I'm kinda confused how anyone uses DMAR with VT-d when it's known to be buggy.
The flags seem to decrease the chances of provoking the bugs, but after a day of running Xorg, it's possible to still hit the RCS0 GPU hangs.
If you don't pass video=SVIDEO-1:d, then atomic's flip_done times out on boot or exit to VT console. It's good that other people have the same issues and have been following the bugzilla tickets, and con confirm the results.
I'm kinda glad I don't have a machine that's newer than Sandybridge since that means I can use 4.1, though it's not a long-term solution, and the plan is for the reported bugzilla tickets to be resolved at some point, or me switching away from Intel GPUs, which might be doable if I save money and get an AMD APU laptop next summer and switch my desktop to a discrete GPU.
For example: https://bugs.freedesktop.org/show_bug.cgi?id=101237 https://bugs.freedesktop.org/show_bug.cgi?id=103076 https://bbs.archlinux.org/viewtopic.php?id=218581&p=3 https://bugs.archlinux.org/task/51703
So, since 4.4, 4.9 and 4.12, drm-tip are still regressive, I wanted to ask if you considered pushing back 4.1's EOL.
Given a look at bugzilla, I have the impression that those issues will need at least another year before they're fixed, since most of them have been sitting there for many, many months. I suspect the Intel DRM team doesn't have the bandwidth to address the issues in a timely fashion while still adding upbringing for new GPUs and features (fences, etc.).
The generic modesetting DDX and Wayland are less susceptible to the GPU hangs, but can be made to provoke it if tried long enough. However, the modesetting DDX tears heavily and is about to gain atomic modesetting in the next Xorg release, so will suffer from the same easy GPU hang likelihood.
Prior to SandyBridge there was zero tearing but beginning with SandyBridge xf86-video-intel's TearFree=TRUE is the only reliable way to fix Xorg tearing.
I do appreciate you maintaining 4.1 so far and hate to admit that I'm reliant on it on more than two machines, before and after Sandybridge, exluding those machines which need a newer kernel. I also understand how much work this is and since I'm not using Linux professionally for a product, I can't offer compensation for your time. I can only offer to collect and point you at a list of DRM bugs for validation of my claims.
Tuncer, where's your bug report? Can't find one. Please file your bug at the fdo bugzilla.
Thanks, Jani.
On Mon, 13 Nov 2017, alexander.levin@verizon.com wrote:
I've cc'ed some folks in hopes to get this resolved upstream.
Either way, 4.1's EoL was previously moved to about 6 months from now, so hopefully we'll have more than enough time to get this resolved.
On Sat, Nov 11, 2017 at 10:13:55PM +0000, Tuncer Ayaz wrote:
The predicament I'm in on my machines is that ever since drm-intel has implemented atomic modesetting, there's a list regressions caused by those fundamental architecture changes and the code churn it implied. This means 4.1 is (from what I can tell) the last kernel before atomic modesetting was added and the only kernel free of all those issues which necessitate trying out various combinations of flags on the kernel cmdline.
For instance, right now I'm trying 4.13.12 with these flags: video=SVIDEO-1:d i915.semaphores=1 i915.enable_rc6=0 i915.enable_psr=0 intel_iommu=igfx_off
PS: I'm kinda confused how anyone uses DMAR with VT-d when it's known to be buggy.
The flags seem to decrease the chances of provoking the bugs, but after a day of running Xorg, it's possible to still hit the RCS0 GPU hangs.
If you don't pass video=SVIDEO-1:d, then atomic's flip_done times out on boot or exit to VT console. It's good that other people have the same issues and have been following the bugzilla tickets, and con confirm the results.
I'm kinda glad I don't have a machine that's newer than Sandybridge since that means I can use 4.1, though it's not a long-term solution, and the plan is for the reported bugzilla tickets to be resolved at some point, or me switching away from Intel GPUs, which might be doable if I save money and get an AMD APU laptop next summer and switch my desktop to a discrete GPU.
For example: https://bugs.freedesktop.org/show_bug.cgi?id=101237 https://bugs.freedesktop.org/show_bug.cgi?id=103076 https://bbs.archlinux.org/viewtopic.php?id=218581&p=3 https://bugs.archlinux.org/task/51703
So, since 4.4, 4.9 and 4.12, drm-tip are still regressive, I wanted to ask if you considered pushing back 4.1's EOL.
Given a look at bugzilla, I have the impression that those issues will need at least another year before they're fixed, since most of them have been sitting there for many, many months. I suspect the Intel DRM team doesn't have the bandwidth to address the issues in a timely fashion while still adding upbringing for new GPUs and features (fences, etc.).
The generic modesetting DDX and Wayland are less susceptible to the GPU hangs, but can be made to provoke it if tried long enough. However, the modesetting DDX tears heavily and is about to gain atomic modesetting in the next Xorg release, so will suffer from the same easy GPU hang likelihood.
Prior to SandyBridge there was zero tearing but beginning with SandyBridge xf86-video-intel's TearFree=TRUE is the only reliable way to fix Xorg tearing.
I do appreciate you maintaining 4.1 so far and hate to admit that I'm reliant on it on more than two machines, before and after Sandybridge, exluding those machines which need a newer kernel. I also understand how much work this is and since I'm not using Linux professionally for a product, I can't offer compensation for your time. I can only offer to collect and point you at a list of DRM bugs for validation of my claims.
On 11/14/17, Jani Nikula jani.nikula@linux.intel.com wrote:
Tuncer, where's your bug report? Can't find one. Please file your bug at the fdo bugzilla.
I'm sorry if this wasn't clear.
I didn't file a bug report since others have already done so, reporting the same symptoms. I did sign up yesterday to confirm this in the most recent bug report. And I don't think it makes sense to re-file the exact same report.
The way I arrived there is via another post in a forum post related to x220 regressions, but it doesn't look exclusive to Sandybridge GPUs.
On Tue, 14 Nov 2017, Tuncer Ayaz tuncer.ayaz@gmail.com wrote:
On 11/14/17, Jani Nikula jani.nikula@linux.intel.com wrote:
Tuncer, where's your bug report? Can't find one. Please file your bug at the fdo bugzilla.
I'm sorry if this wasn't clear.
I didn't file a bug report since others have already done so, reporting the same symptoms. I did sign up yesterday to confirm this in the most recent bug report. And I don't think it makes sense to re-file the exact same report.
The way I arrived there is via another post in a forum post related to x220 regressions, but it doesn't look exclusive to Sandybridge GPUs.
The freedesktop.org bugs you reference are for rather different platforms than yours. There's nothing there to indicate v4.1 being the last known good kernel like for you. There is no exact same report.
Please file the bug. Please run v4.14 or drm-tip branch from [1]. Please remove all other module parameters, but add drm.debug=14, and attach the dmesg from boot to the problem. Please attach the GPU error state if you get a GPU hang. Please let us decide if we've seen the bug before or not.
We've been continuously improving our CI and test assets and expanding the hardware pool we run the tests on for years now. Even so, bugs obviously slip through. And it's really *really* hard to revert anything or fix regressions when we get the reports about two years or a dozen kernel releases after we've broken stuff. :(
BR, Jani.
[1] https://cgit.freedesktop.org/drm/drm-tip
On 11/15/17, Jani Nikula jani.nikula@linux.intel.com wrote:
The freedesktop.org bugs you reference are for rather different platforms than yours. There's nothing there to indicate v4.1 being the last known good kernel like for you. There is no exact same report.
I don't follow why you think it's a different platform and how I might have "more" definitely shown v4.1 to be good, but I'll trust your judgement as a drm dev and not argue :).
Please file the bug. Please run v4.14 or drm-tip branch from [1]. Please remove all other module parameters, but add drm.debug=14, and attach the dmesg from boot to the problem. Please attach the GPU error state if you get a GPU hang. Please let us decide if we've seen the bug before or not.
Is the flip_done timeout on exit from Xorg a separate bug? That's one of the symptoms.
The other symptom is GEM errors in dmesg followed by rcs0 gpu hangs some time later.
In both cases the machine will be temporarily unresponsive or even hang indefinitely.
I can't say when the bugs will be filed. Hopefully soon.
We've been continuously improving our CI and test assets and expanding the hardware pool we run the tests on for years now. Even so, bugs obviously slip through. And it's really *really* hard to revert anything or fix regressions when we get the reports about two years or a dozen kernel releases after we've broken stuff. :(
Sure, but it's important to note that the rcs0 hangs have been very visible in 4.13 and, if included, better hidden in older kernels. Meaning, it didn't appear as easily in older kernels for me to take notice and report.
On Wed, 15 Nov 2017, Tuncer Ayaz tuncer.ayaz@gmail.com wrote:
I don't follow why you think it's a different platform and how I might have "more" definitely shown v4.1 to be good, but I'll trust your judgement as a drm dev and not argue :).
You apparently have Sandy Bridge, the referenced reports are about Broadwell and Skylake. Even if the symptoms you see are the same, the root causes might be wildly different, needing a different fix.
I've learned the hard way not to make assumptions without detailed information, which in this case I don't have. As in, I don't even know for sure if you have Sandy Bridge or not, although it's alluded to in your message.
From my point of view, you're shouting regression while giving us
nothing to work with. You need to help us to help you.
BR, Jani.
On 11/16/17, Jani Nikula jani.nikula@linux.intel.com wrote:
On Wed, 15 Nov 2017, Tuncer Ayaz tuncer.ayaz@gmail.com wrote:
I don't follow why you think it's a different platform and how I might have "more" definitely shown v4.1 to be good, but I'll trust your judgement as a drm dev and not argue :).
You apparently have Sandy Bridge, the referenced reports are about Broadwell and Skylake. Even if the symptoms you see are the same, the root causes might be wildly different, needing a different fix.
Thanks for taking time to explain and clear my confusion :).
I checked the comments of the other reporter with a Sandy Bridge system, and they haven't provided a proper trace. Hence, you're absolutely right.
I've learned the hard way not to make assumptions without detailed information, which in this case I don't have. As in, I don't even know for sure if you have Sandy Bridge or not, although it's alluded to in your message.
I do (Sandy Bridge), sorry for not being clearer about that.
From my point of view, you're shouting regression while giving us nothing to work with. You need to help us to help you.
Like I said, will do.
linux-stable-mirror@lists.linaro.org