Flush xe ordered_wq in case of ufence timeout which is observed on LNL and that points to the recent scheduling issue with E-cores.
This is similar to the recent fix: commit e51527233804 ("drm/xe/guc/ct: Flush g2h worker in case of g2h response timeout") and should be removed once there is E core scheduling fix.
v2: Add platform check(Himal) s/__flush_workqueue/flush_workqueue(Jani)
Cc: Badal Nilawar badal.nilawar@intel.com Cc: Jani Nikula jani.nikula@intel.com Cc: Matthew Auld matthew.auld@intel.com Cc: John Harrison John.C.Harrison@Intel.com Cc: Himal Prasad Ghimiray himal.prasad.ghimiray@intel.com Cc: Lucas De Marchi lucas.demarchi@intel.com Cc: stable@vger.kernel.org # v6.11+ Link: https://gitlab.freedesktop.org/drm/xe/kernel/-/issues/2754 Suggested-by: Matthew Brost matthew.brost@intel.com Signed-off-by: Nirmoy Das nirmoy.das@intel.com Reviewed-by: Matthew Brost matthew.brost@intel.com --- drivers/gpu/drm/xe/xe_wait_user_fence.c | 14 ++++++++++++++ 1 file changed, 14 insertions(+)
diff --git a/drivers/gpu/drm/xe/xe_wait_user_fence.c b/drivers/gpu/drm/xe/xe_wait_user_fence.c index f5deb81eba01..78a0ad3c78fe 100644 --- a/drivers/gpu/drm/xe/xe_wait_user_fence.c +++ b/drivers/gpu/drm/xe/xe_wait_user_fence.c @@ -13,6 +13,7 @@ #include "xe_device.h" #include "xe_gt.h" #include "xe_macros.h" +#include "compat-i915-headers/i915_drv.h" #include "xe_exec_queue.h"
static int do_compare(u64 addr, u64 value, u64 mask, u16 op) @@ -155,6 +156,19 @@ int xe_wait_user_fence_ioctl(struct drm_device *dev, void *data, }
if (!timeout) { + if (IS_LUNARLAKE(xe)) { + /* + * This is analogous to e51527233804 ("drm/xe/guc/ct: Flush g2h + * worker in case of g2h response timeout") + * + * TODO: Drop this change once workqueue scheduling delay issue is + * fixed on LNL Hybrid CPU. + */ + flush_workqueue(xe->ordered_wq); + err = do_compare(addr, args->value, args->mask, args->op); + if (err <= 0) + break; + } err = -ETIME; break; }
On 10/24/2024 5:18 PM, Nirmoy Das wrote:
Flush xe ordered_wq in case of ufence timeout which is observed on LNL and that points to the recent scheduling issue with E-cores.
This is similar to the recent fix: commit e51527233804 ("drm/xe/guc/ct: Flush g2h worker in case of g2h response timeout") and should be removed once there is E core scheduling fix.
v2: Add platform check(Himal) s/__flush_workqueue/flush_workqueue(Jani)
Cc: Badal Nilawar badal.nilawar@intel.com Cc: Jani Nikula jani.nikula@intel.com Cc: Matthew Auld matthew.auld@intel.com Cc: John Harrison John.C.Harrison@Intel.com Cc: Himal Prasad Ghimiray himal.prasad.ghimiray@intel.com Cc: Lucas De Marchi lucas.demarchi@intel.com Cc: stable@vger.kernel.org # v6.11+ Link: https://gitlab.freedesktop.org/drm/xe/kernel/-/issues/2754 Suggested-by: Matthew Brost matthew.brost@intel.com Signed-off-by: Nirmoy Das nirmoy.das@intel.com Reviewed-by: Matthew Brost matthew.brost@intel.com
drivers/gpu/drm/xe/xe_wait_user_fence.c | 14 ++++++++++++++ 1 file changed, 14 insertions(+)
diff --git a/drivers/gpu/drm/xe/xe_wait_user_fence.c b/drivers/gpu/drm/xe/xe_wait_user_fence.c index f5deb81eba01..78a0ad3c78fe 100644 --- a/drivers/gpu/drm/xe/xe_wait_user_fence.c +++ b/drivers/gpu/drm/xe/xe_wait_user_fence.c @@ -13,6 +13,7 @@ #include "xe_device.h" #include "xe_gt.h" #include "xe_macros.h" +#include "compat-i915-headers/i915_drv.h"
Sorry sent too soon. This is bit out of place. I will sort it and resend after sometime to accumulate reviews.
#include "xe_exec_queue.h" static int do_compare(u64 addr, u64 value, u64 mask, u16 op) @@ -155,6 +156,19 @@ int xe_wait_user_fence_ioctl(struct drm_device *dev, void *data, } if (!timeout) {
if (IS_LUNARLAKE(xe)) {
/*
* This is analogous to e51527233804 ("drm/xe/guc/ct: Flush g2h
* worker in case of g2h response timeout")
*
* TODO: Drop this change once workqueue scheduling delay issue is
* fixed on LNL Hybrid CPU.
*/
flush_workqueue(xe->ordered_wq);
err = do_compare(addr, args->value, args->mask, args->op);
if (err <= 0)
break;
}} err = -ETIME; break;
On Thu, 24 Oct 2024, Nirmoy Das nirmoy.das@intel.com wrote:
Flush xe ordered_wq in case of ufence timeout which is observed on LNL and that points to the recent scheduling issue with E-cores.
This is similar to the recent fix: commit e51527233804 ("drm/xe/guc/ct: Flush g2h worker in case of g2h response timeout") and should be removed once there is E core scheduling fix.
v2: Add platform check(Himal) s/__flush_workqueue/flush_workqueue(Jani)
Cc: Badal Nilawar badal.nilawar@intel.com Cc: Jani Nikula jani.nikula@intel.com Cc: Matthew Auld matthew.auld@intel.com Cc: John Harrison John.C.Harrison@Intel.com Cc: Himal Prasad Ghimiray himal.prasad.ghimiray@intel.com Cc: Lucas De Marchi lucas.demarchi@intel.com Cc: stable@vger.kernel.org # v6.11+ Link: https://gitlab.freedesktop.org/drm/xe/kernel/-/issues/2754 Suggested-by: Matthew Brost matthew.brost@intel.com Signed-off-by: Nirmoy Das nirmoy.das@intel.com Reviewed-by: Matthew Brost matthew.brost@intel.com
drivers/gpu/drm/xe/xe_wait_user_fence.c | 14 ++++++++++++++ 1 file changed, 14 insertions(+)
diff --git a/drivers/gpu/drm/xe/xe_wait_user_fence.c b/drivers/gpu/drm/xe/xe_wait_user_fence.c index f5deb81eba01..78a0ad3c78fe 100644 --- a/drivers/gpu/drm/xe/xe_wait_user_fence.c +++ b/drivers/gpu/drm/xe/xe_wait_user_fence.c @@ -13,6 +13,7 @@ #include "xe_device.h" #include "xe_gt.h" #include "xe_macros.h" +#include "compat-i915-headers/i915_drv.h"
Sorry, you just can't use this in xe core. At all. Not even a little bit. It's purely for i915 display compat code.
If you need it for the LNL platform check, you need to use:
xe->info.platform == XE_LUNARLAKE
Although platform checks in xe code are generally discouraged.
BR, Jani.
#include "xe_exec_queue.h" static int do_compare(u64 addr, u64 value, u64 mask, u16 op) @@ -155,6 +156,19 @@ int xe_wait_user_fence_ioctl(struct drm_device *dev, void *data, } if (!timeout) {
if (IS_LUNARLAKE(xe)) {
/*
* This is analogous to e51527233804 ("drm/xe/guc/ct: Flush g2h
* worker in case of g2h response timeout")
*
* TODO: Drop this change once workqueue scheduling delay issue is
* fixed on LNL Hybrid CPU.
*/
flush_workqueue(xe->ordered_wq);
err = do_compare(addr, args->value, args->mask, args->op);
if (err <= 0)
break;
}} err = -ETIME; break;
On 10/24/2024 6:32 PM, Jani Nikula wrote:
On Thu, 24 Oct 2024, Nirmoy Das nirmoy.das@intel.com wrote:
Flush xe ordered_wq in case of ufence timeout which is observed on LNL and that points to the recent scheduling issue with E-cores.
This is similar to the recent fix: commit e51527233804 ("drm/xe/guc/ct: Flush g2h worker in case of g2h response timeout") and should be removed once there is E core scheduling fix.
v2: Add platform check(Himal) s/__flush_workqueue/flush_workqueue(Jani)
Cc: Badal Nilawar badal.nilawar@intel.com Cc: Jani Nikula jani.nikula@intel.com Cc: Matthew Auld matthew.auld@intel.com Cc: John Harrison John.C.Harrison@Intel.com Cc: Himal Prasad Ghimiray himal.prasad.ghimiray@intel.com Cc: Lucas De Marchi lucas.demarchi@intel.com Cc: stable@vger.kernel.org # v6.11+ Link: https://gitlab.freedesktop.org/drm/xe/kernel/-/issues/2754 Suggested-by: Matthew Brost matthew.brost@intel.com Signed-off-by: Nirmoy Das nirmoy.das@intel.com Reviewed-by: Matthew Brost matthew.brost@intel.com
drivers/gpu/drm/xe/xe_wait_user_fence.c | 14 ++++++++++++++ 1 file changed, 14 insertions(+)
diff --git a/drivers/gpu/drm/xe/xe_wait_user_fence.c b/drivers/gpu/drm/xe/xe_wait_user_fence.c index f5deb81eba01..78a0ad3c78fe 100644 --- a/drivers/gpu/drm/xe/xe_wait_user_fence.c +++ b/drivers/gpu/drm/xe/xe_wait_user_fence.c @@ -13,6 +13,7 @@ #include "xe_device.h" #include "xe_gt.h" #include "xe_macros.h" +#include "compat-i915-headers/i915_drv.h"
Sorry, you just can't use this in xe core. At all. Not even a little bit. It's purely for i915 display compat code.
If you need it for the LNL platform check, you need to use:
xe->info.platform == XE_LUNARLAKE
Will do that. That macro looked odd but I didn't know a better way.
Although platform checks in xe code are generally discouraged.
This issue unfortunately depending on platform instead of graphics IP.
Thanks,
Nirmoy
BR, Jani.
#include "xe_exec_queue.h" static int do_compare(u64 addr, u64 value, u64 mask, u16 op) @@ -155,6 +156,19 @@ int xe_wait_user_fence_ioctl(struct drm_device *dev, void *data, } if (!timeout) {
if (IS_LUNARLAKE(xe)) {
/*
* This is analogous to e51527233804 ("drm/xe/guc/ct: Flush g2h
* worker in case of g2h response timeout")
*
* TODO: Drop this change once workqueue scheduling delay issue is
* fixed on LNL Hybrid CPU.
*/
flush_workqueue(xe->ordered_wq);
err = do_compare(addr, args->value, args->mask, args->op);
if (err <= 0)
break;
}} err = -ETIME; break;
On 10/25/2024 09:03, Nirmoy Das wrote:
On 10/24/2024 6:32 PM, Jani Nikula wrote:
On Thu, 24 Oct 2024, Nirmoy Das nirmoy.das@intel.com wrote:
Flush xe ordered_wq in case of ufence timeout which is observed on LNL and that points to the recent scheduling issue with E-cores.
This is similar to the recent fix: commit e51527233804 ("drm/xe/guc/ct: Flush g2h worker in case of g2h response timeout") and should be removed once there is E core scheduling fix.
v2: Add platform check(Himal) s/__flush_workqueue/flush_workqueue(Jani)
Cc: Badal Nilawar badal.nilawar@intel.com Cc: Jani Nikula jani.nikula@intel.com Cc: Matthew Auld matthew.auld@intel.com Cc: John Harrison John.C.Harrison@Intel.com Cc: Himal Prasad Ghimiray himal.prasad.ghimiray@intel.com Cc: Lucas De Marchi lucas.demarchi@intel.com Cc: stable@vger.kernel.org # v6.11+ Link: https://gitlab.freedesktop.org/drm/xe/kernel/-/issues/2754 Suggested-by: Matthew Brost matthew.brost@intel.com Signed-off-by: Nirmoy Das nirmoy.das@intel.com Reviewed-by: Matthew Brost matthew.brost@intel.com
drivers/gpu/drm/xe/xe_wait_user_fence.c | 14 ++++++++++++++ 1 file changed, 14 insertions(+)
diff --git a/drivers/gpu/drm/xe/xe_wait_user_fence.c b/drivers/gpu/drm/xe/xe_wait_user_fence.c index f5deb81eba01..78a0ad3c78fe 100644 --- a/drivers/gpu/drm/xe/xe_wait_user_fence.c +++ b/drivers/gpu/drm/xe/xe_wait_user_fence.c @@ -13,6 +13,7 @@ #include "xe_device.h" #include "xe_gt.h" #include "xe_macros.h" +#include "compat-i915-headers/i915_drv.h"
Sorry, you just can't use this in xe core. At all. Not even a little bit. It's purely for i915 display compat code.
If you need it for the LNL platform check, you need to use:
xe->info.platform == XE_LUNARLAKE
Will do that. That macro looked odd but I didn't know a better way.
Although platform checks in xe code are generally discouraged.
This issue unfortunately depending on platform instead of graphics IP.
But isn't this issue dependent upon the CPU platform not the graphics platform? As in, a DG2 card plugged in to a LNL host will also have this issue. So testing any graphics related value is technically incorrect.
John.
Thanks,
Nirmoy
BR, Jani.
#include "xe_exec_queue.h" static int do_compare(u64 addr, u64 value, u64 mask, u16 op) @@ -155,6 +156,19 @@ int xe_wait_user_fence_ioctl(struct drm_device *dev, void *data, } if (!timeout) {
if (IS_LUNARLAKE(xe)) {
/*
* This is analogous to e51527233804 ("drm/xe/guc/ct: Flush g2h
* worker in case of g2h response timeout")
*
* TODO: Drop this change once workqueue scheduling delay issue is
* fixed on LNL Hybrid CPU.
*/
flush_workqueue(xe->ordered_wq);
err = do_compare(addr, args->value, args->mask, args->op);
if (err <= 0)
break;
}} err = -ETIME; break;
On Fri, Oct 25, 2024 at 11:27:55AM -0700, John Harrison wrote:
On 10/25/2024 09:03, Nirmoy Das wrote:
On 10/24/2024 6:32 PM, Jani Nikula wrote:
On Thu, 24 Oct 2024, Nirmoy Das nirmoy.das@intel.com wrote:
Flush xe ordered_wq in case of ufence timeout which is observed on LNL and that points to the recent scheduling issue with E-cores.
This is similar to the recent fix: commit e51527233804 ("drm/xe/guc/ct: Flush g2h worker in case of g2h response timeout") and should be removed once there is E core scheduling fix.
v2: Add platform check(Himal) s/__flush_workqueue/flush_workqueue(Jani)
Cc: Badal Nilawar badal.nilawar@intel.com Cc: Jani Nikula jani.nikula@intel.com Cc: Matthew Auld matthew.auld@intel.com Cc: John Harrison John.C.Harrison@Intel.com Cc: Himal Prasad Ghimiray himal.prasad.ghimiray@intel.com Cc: Lucas De Marchi lucas.demarchi@intel.com Cc: stable@vger.kernel.org # v6.11+ Link: https://gitlab.freedesktop.org/drm/xe/kernel/-/issues/2754 Suggested-by: Matthew Brost matthew.brost@intel.com Signed-off-by: Nirmoy Das nirmoy.das@intel.com Reviewed-by: Matthew Brost matthew.brost@intel.com
drivers/gpu/drm/xe/xe_wait_user_fence.c | 14 ++++++++++++++ 1 file changed, 14 insertions(+)
diff --git a/drivers/gpu/drm/xe/xe_wait_user_fence.c b/drivers/gpu/drm/xe/xe_wait_user_fence.c index f5deb81eba01..78a0ad3c78fe 100644 --- a/drivers/gpu/drm/xe/xe_wait_user_fence.c +++ b/drivers/gpu/drm/xe/xe_wait_user_fence.c @@ -13,6 +13,7 @@ #include "xe_device.h" #include "xe_gt.h" #include "xe_macros.h" +#include "compat-i915-headers/i915_drv.h"
Sorry, you just can't use this in xe core. At all. Not even a little bit. It's purely for i915 display compat code.
If you need it for the LNL platform check, you need to use:
xe->info.platform == XE_LUNARLAKE
Will do that. That macro looked odd but I didn't know a better way.
Although platform checks in xe code are generally discouraged.
This issue unfortunately depending on platform instead of graphics IP.
But isn't this issue dependent upon the CPU platform not the graphics platform? As in, a DG2 card plugged in to a LNL host will also have this issue. So testing any graphics related value is technically incorrect.
This is a good point, maybe for now we blindly do this regardless of platform. It is basically harmless to do this after a timeout... Also a warning message if we can detect this fixed the timeout for CI purposes.
Matt
John.
Thanks,
Nirmoy
BR, Jani.
#include "xe_exec_queue.h" static int do_compare(u64 addr, u64 value, u64 mask, u16 op) @@ -155,6 +156,19 @@ int xe_wait_user_fence_ioctl(struct drm_device *dev, void *data, } if (!timeout) {
if (IS_LUNARLAKE(xe)) {
/*
* This is analogous to e51527233804 ("drm/xe/guc/ct: Flush g2h
* worker in case of g2h response timeout")
*
* TODO: Drop this change once workqueue scheduling delay issue is
* fixed on LNL Hybrid CPU.
*/
flush_workqueue(xe->ordered_wq);
err = do_compare(addr, args->value, args->mask, args->op);
if (err <= 0)
break;
}} err = -ETIME; break;
On 10/25/2024 8:34 PM, Matthew Brost wrote:
On Fri, Oct 25, 2024 at 11:27:55AM -0700, John Harrison wrote:
On 10/25/2024 09:03, Nirmoy Das wrote:
On 10/24/2024 6:32 PM, Jani Nikula wrote:
On Thu, 24 Oct 2024, Nirmoy Das nirmoy.das@intel.com wrote:
Flush xe ordered_wq in case of ufence timeout which is observed on LNL and that points to the recent scheduling issue with E-cores.
This is similar to the recent fix: commit e51527233804 ("drm/xe/guc/ct: Flush g2h worker in case of g2h response timeout") and should be removed once there is E core scheduling fix.
v2: Add platform check(Himal) s/__flush_workqueue/flush_workqueue(Jani)
Cc: Badal Nilawar badal.nilawar@intel.com Cc: Jani Nikula jani.nikula@intel.com Cc: Matthew Auld matthew.auld@intel.com Cc: John Harrison John.C.Harrison@Intel.com Cc: Himal Prasad Ghimiray himal.prasad.ghimiray@intel.com Cc: Lucas De Marchi lucas.demarchi@intel.com Cc: stable@vger.kernel.org # v6.11+ Link: https://gitlab.freedesktop.org/drm/xe/kernel/-/issues/2754 Suggested-by: Matthew Brost matthew.brost@intel.com Signed-off-by: Nirmoy Das nirmoy.das@intel.com Reviewed-by: Matthew Brost matthew.brost@intel.com
drivers/gpu/drm/xe/xe_wait_user_fence.c | 14 ++++++++++++++ 1 file changed, 14 insertions(+)
diff --git a/drivers/gpu/drm/xe/xe_wait_user_fence.c b/drivers/gpu/drm/xe/xe_wait_user_fence.c index f5deb81eba01..78a0ad3c78fe 100644 --- a/drivers/gpu/drm/xe/xe_wait_user_fence.c +++ b/drivers/gpu/drm/xe/xe_wait_user_fence.c @@ -13,6 +13,7 @@ #include "xe_device.h" #include "xe_gt.h" #include "xe_macros.h" +#include "compat-i915-headers/i915_drv.h"
Sorry, you just can't use this in xe core. At all. Not even a little bit. It's purely for i915 display compat code.
If you need it for the LNL platform check, you need to use:
xe->info.platform == XE_LUNARLAKE
Will do that. That macro looked odd but I didn't know a better way.
Although platform checks in xe code are generally discouraged.
This issue unfortunately depending on platform instead of graphics IP.
But isn't this issue dependent upon the CPU platform not the graphics platform? As in, a DG2 card plugged in to a LNL host will also have this issue. So testing any graphics related value is technically incorrect.
Haven't thought about. LNL only has x8 PCIe lanes shared between NVME and other IOs but thunderbolt based eGPU should be easily doable.
I think I could do "if (boot_cpu_data.x86_vfm == INTEL_LUNARLAKE_M)" instead.
This is a good point, maybe for now we blindly do this regardless of platform. It is basically harmless to do this after a timeout... Also a warning message if we can detect this fixed the timeout for CI purposes.
I am open to this as well. Please let me know which one should be a better solution here.
Regards,
Nirmoy
Matt
John.
Thanks,
Nirmoy
BR, Jani.
#include "xe_exec_queue.h" static int do_compare(u64 addr, u64 value, u64 mask, u16 op) @@ -155,6 +156,19 @@ int xe_wait_user_fence_ioctl(struct drm_device *dev, void *data, } if (!timeout) {
if (IS_LUNARLAKE(xe)) {
/*
* This is analogous to e51527233804 ("drm/xe/guc/ct: Flush g2h
* worker in case of g2h response timeout")
*
* TODO: Drop this change once workqueue scheduling delay issue is
* fixed on LNL Hybrid CPU.
*/
flush_workqueue(xe->ordered_wq);
err = do_compare(addr, args->value, args->mask, args->op);
if (err <= 0)
break;
}} err = -ETIME; break;
On Fri, Oct 25, 2024 at 09:33:39PM +0200, Nirmoy Das wrote:
On 10/25/2024 8:34 PM, Matthew Brost wrote:
On Fri, Oct 25, 2024 at 11:27:55AM -0700, John Harrison wrote:
On 10/25/2024 09:03, Nirmoy Das wrote:
On 10/24/2024 6:32 PM, Jani Nikula wrote:
On Thu, 24 Oct 2024, Nirmoy Das nirmoy.das@intel.com wrote:
Flush xe ordered_wq in case of ufence timeout which is observed on LNL and that points to the recent scheduling issue with E-cores.
This is similar to the recent fix: commit e51527233804 ("drm/xe/guc/ct: Flush g2h worker in case of g2h response timeout") and should be removed once there is E core scheduling fix.
v2: Add platform check(Himal) s/__flush_workqueue/flush_workqueue(Jani)
Cc: Badal Nilawar badal.nilawar@intel.com Cc: Jani Nikula jani.nikula@intel.com Cc: Matthew Auld matthew.auld@intel.com Cc: John Harrison John.C.Harrison@Intel.com Cc: Himal Prasad Ghimiray himal.prasad.ghimiray@intel.com Cc: Lucas De Marchi lucas.demarchi@intel.com Cc: stable@vger.kernel.org # v6.11+ Link: https://gitlab.freedesktop.org/drm/xe/kernel/-/issues/2754 Suggested-by: Matthew Brost matthew.brost@intel.com Signed-off-by: Nirmoy Das nirmoy.das@intel.com Reviewed-by: Matthew Brost matthew.brost@intel.com
drivers/gpu/drm/xe/xe_wait_user_fence.c | 14 ++++++++++++++ 1 file changed, 14 insertions(+)
diff --git a/drivers/gpu/drm/xe/xe_wait_user_fence.c b/drivers/gpu/drm/xe/xe_wait_user_fence.c index f5deb81eba01..78a0ad3c78fe 100644 --- a/drivers/gpu/drm/xe/xe_wait_user_fence.c +++ b/drivers/gpu/drm/xe/xe_wait_user_fence.c @@ -13,6 +13,7 @@ #include "xe_device.h" #include "xe_gt.h" #include "xe_macros.h" +#include "compat-i915-headers/i915_drv.h"
Sorry, you just can't use this in xe core. At all. Not even a little bit. It's purely for i915 display compat code.
If you need it for the LNL platform check, you need to use:
xe->info.platform == XE_LUNARLAKE
Will do that. That macro looked odd but I didn't know a better way.
Although platform checks in xe code are generally discouraged.
This issue unfortunately depending on platform instead of graphics IP.
But isn't this issue dependent upon the CPU platform not the graphics platform? As in, a DG2 card plugged in to a LNL host will also have this issue. So testing any graphics related value is technically incorrect.
Haven't thought about. LNL only has x8 PCIe lanes shared between NVME and other IOs but thunderbolt based eGPU should be easily doable.
I think I could do "if (boot_cpu_data.x86_vfm == INTEL_LUNARLAKE_M)" instead.
This is a good point, maybe for now we blindly do this regardless of platform. It is basically harmless to do this after a timeout... Also a warning message if we can detect this fixed the timeout for CI purposes.
I am open to this as well. Please let me know which one should be a better solution here.
if it's a cheap thing without side-effects, go for the version without the platform check and document it in commit message / source comment
Lucas De Marchi
On 10/25/2024 9:56 PM, Lucas De Marchi wrote:
On Fri, Oct 25, 2024 at 09:33:39PM +0200, Nirmoy Das wrote:
On 10/25/2024 8:34 PM, Matthew Brost wrote:
On Fri, Oct 25, 2024 at 11:27:55AM -0700, John Harrison wrote:
On 10/25/2024 09:03, Nirmoy Das wrote:
On 10/24/2024 6:32 PM, Jani Nikula wrote:
On Thu, 24 Oct 2024, Nirmoy Das nirmoy.das@intel.com wrote: > Flush xe ordered_wq in case of ufence timeout which is observed > on LNL and that points to the recent scheduling issue with E-cores. > > This is similar to the recent fix: > commit e51527233804 ("drm/xe/guc/ct: Flush g2h worker in case of g2h > response timeout") and should be removed once there is E core > scheduling fix. > > v2: Add platform check(Himal) > s/__flush_workqueue/flush_workqueue(Jani) > > Cc: Badal Nilawar badal.nilawar@intel.com > Cc: Jani Nikula jani.nikula@intel.com > Cc: Matthew Auld matthew.auld@intel.com > Cc: John Harrison John.C.Harrison@Intel.com > Cc: Himal Prasad Ghimiray himal.prasad.ghimiray@intel.com > Cc: Lucas De Marchi lucas.demarchi@intel.com > Cc: stable@vger.kernel.org # v6.11+ > Link: https://gitlab.freedesktop.org/drm/xe/kernel/-/issues/2754 > Suggested-by: Matthew Brost matthew.brost@intel.com > Signed-off-by: Nirmoy Das nirmoy.das@intel.com > Reviewed-by: Matthew Brost matthew.brost@intel.com > --- > drivers/gpu/drm/xe/xe_wait_user_fence.c | 14 ++++++++++++++ > 1 file changed, 14 insertions(+) > > diff --git a/drivers/gpu/drm/xe/xe_wait_user_fence.c b/drivers/gpu/drm/xe/xe_wait_user_fence.c > index f5deb81eba01..78a0ad3c78fe 100644 > --- a/drivers/gpu/drm/xe/xe_wait_user_fence.c > +++ b/drivers/gpu/drm/xe/xe_wait_user_fence.c > @@ -13,6 +13,7 @@ > #include "xe_device.h" > #include "xe_gt.h" > #include "xe_macros.h" > +#include "compat-i915-headers/i915_drv.h" Sorry, you just can't use this in xe core. At all. Not even a little bit. It's purely for i915 display compat code.
If you need it for the LNL platform check, you need to use:
xe->info.platform == XE_LUNARLAKE
Will do that. That macro looked odd but I didn't know a better way.
Although platform checks in xe code are generally discouraged.
This issue unfortunately depending on platform instead of graphics IP.
But isn't this issue dependent upon the CPU platform not the graphics platform? As in, a DG2 card plugged in to a LNL host will also have this issue. So testing any graphics related value is technically incorrect.
Haven't thought about. LNL only has x8 PCIe lanes shared between NVME and other IOs but thunderbolt based eGPU should be easily doable.
I think I could do "if (boot_cpu_data.x86_vfm == INTEL_LUNARLAKE_M)" instead.
This is a good point, maybe for now we blindly do this regardless of platform. It is basically harmless to do this after a timeout... Also a warning message if we can detect this fixed the timeout for CI purposes.
I am open to this as well. Please let me know which one should be a better solution here.
if it's a cheap thing without side-effects, go for the version without the platform check and document it in commit message / source comment
That would be the previous rev. I will add the missing stable Cc and resend.
Thanks,
Nirmoy
Lucas De Marchi
On 10/24/2024 08:18, Nirmoy Das wrote:
Flush xe ordered_wq in case of ufence timeout which is observed on LNL and that points to the recent scheduling issue with E-cores.
This is similar to the recent fix: commit e51527233804 ("drm/xe/guc/ct: Flush g2h worker in case of g2h response timeout") and should be removed once there is E core scheduling fix.
v2: Add platform check(Himal) s/__flush_workqueue/flush_workqueue(Jani)
Cc: Badal Nilawar badal.nilawar@intel.com Cc: Jani Nikula jani.nikula@intel.com Cc: Matthew Auld matthew.auld@intel.com Cc: John Harrison John.C.Harrison@Intel.com Cc: Himal Prasad Ghimiray himal.prasad.ghimiray@intel.com Cc: Lucas De Marchi lucas.demarchi@intel.com Cc: stable@vger.kernel.org # v6.11+ Link: https://gitlab.freedesktop.org/drm/xe/kernel/-/issues/2754 Suggested-by: Matthew Brost matthew.brost@intel.com Signed-off-by: Nirmoy Das nirmoy.das@intel.com Reviewed-by: Matthew Brost matthew.brost@intel.com
drivers/gpu/drm/xe/xe_wait_user_fence.c | 14 ++++++++++++++ 1 file changed, 14 insertions(+)
diff --git a/drivers/gpu/drm/xe/xe_wait_user_fence.c b/drivers/gpu/drm/xe/xe_wait_user_fence.c index f5deb81eba01..78a0ad3c78fe 100644 --- a/drivers/gpu/drm/xe/xe_wait_user_fence.c +++ b/drivers/gpu/drm/xe/xe_wait_user_fence.c @@ -13,6 +13,7 @@ #include "xe_device.h" #include "xe_gt.h" #include "xe_macros.h" +#include "compat-i915-headers/i915_drv.h" #include "xe_exec_queue.h" static int do_compare(u64 addr, u64 value, u64 mask, u16 op) @@ -155,6 +156,19 @@ int xe_wait_user_fence_ioctl(struct drm_device *dev, void *data, } if (!timeout) {
if (IS_LUNARLAKE(xe)) {
/*
* This is analogous to e51527233804 ("drm/xe/guc/ct: Flush g2h
* worker in case of g2h response timeout")
*
* TODO: Drop this change once workqueue scheduling delay issue is
* fixed on LNL Hybrid CPU.
*/
flush_workqueue(xe->ordered_wq);
If we are having multiple instances of this workaround, can we wrap them up in as 'LNL_FLUSH_WORKQUEUE(q)' or some such? Put the IS_LNL check inside the macro and make it pretty obvious exactly where all the instances are by having a single macro name to search for.
John.
err = do_compare(addr, args->value, args->mask, args->op);
if (err <= 0)
break;
}} err = -ETIME; break;
On Thu, Oct 24, 2024 at 10:14:21AM -0700, John Harrison wrote:
On 10/24/2024 08:18, Nirmoy Das wrote:
Flush xe ordered_wq in case of ufence timeout which is observed on LNL and that points to the recent scheduling issue with E-cores.
This is similar to the recent fix: commit e51527233804 ("drm/xe/guc/ct: Flush g2h worker in case of g2h response timeout") and should be removed once there is E core scheduling fix.
v2: Add platform check(Himal) s/__flush_workqueue/flush_workqueue(Jani)
Cc: Badal Nilawar badal.nilawar@intel.com Cc: Jani Nikula jani.nikula@intel.com Cc: Matthew Auld matthew.auld@intel.com Cc: John Harrison John.C.Harrison@Intel.com Cc: Himal Prasad Ghimiray himal.prasad.ghimiray@intel.com Cc: Lucas De Marchi lucas.demarchi@intel.com Cc: stable@vger.kernel.org # v6.11+ Link: https://gitlab.freedesktop.org/drm/xe/kernel/-/issues/2754 Suggested-by: Matthew Brost matthew.brost@intel.com Signed-off-by: Nirmoy Das nirmoy.das@intel.com Reviewed-by: Matthew Brost matthew.brost@intel.com
drivers/gpu/drm/xe/xe_wait_user_fence.c | 14 ++++++++++++++ 1 file changed, 14 insertions(+)
diff --git a/drivers/gpu/drm/xe/xe_wait_user_fence.c b/drivers/gpu/drm/xe/xe_wait_user_fence.c index f5deb81eba01..78a0ad3c78fe 100644 --- a/drivers/gpu/drm/xe/xe_wait_user_fence.c +++ b/drivers/gpu/drm/xe/xe_wait_user_fence.c @@ -13,6 +13,7 @@ #include "xe_device.h" #include "xe_gt.h" #include "xe_macros.h" +#include "compat-i915-headers/i915_drv.h" #include "xe_exec_queue.h" static int do_compare(u64 addr, u64 value, u64 mask, u16 op) @@ -155,6 +156,19 @@ int xe_wait_user_fence_ioctl(struct drm_device *dev, void *data, } if (!timeout) {
if (IS_LUNARLAKE(xe)) {
/*
* This is analogous to e51527233804 ("drm/xe/guc/ct: Flush g2h
* worker in case of g2h response timeout")
*
* TODO: Drop this change once workqueue scheduling delay issue is
* fixed on LNL Hybrid CPU.
*/
flush_workqueue(xe->ordered_wq);
If we are having multiple instances of this workaround, can we wrap them up in as 'LNL_FLUSH_WORKQUEUE(q)' or some such? Put the IS_LNL check inside the macro and make it pretty obvious exactly where all the instances are by having a single macro name to search for.
+1, I think Lucas is suggesting something similar to this on the chat to make sure we don't lose track of removing these W/A when this gets fixed.
Matt
John.
err = do_compare(addr, args->value, args->mask, args->op);
if (err <= 0)
break;
}} err = -ETIME; break;
linux-stable-mirror@lists.linaro.org