Ensure a non-interruptible wait is used when moving a bo to XE_PL_SYSTEM. This prevents dma_mappings from being removed prematurely while a GPU job is still in progress, even if the CPU receives a signal during the operation.
Fixes: 75521e8b56e8 ("drm/xe: Perform dma_map when moving system buffer objects to TT") Cc: Thomas Hellström thomas.hellstrom@linux.intel.com Cc: Matthew Brost matthew.brost@intel.com Cc: Lucas De Marchi lucas.demarchi@intel.com Cc: stable@vger.kernel.org # v6.11+ Suggested-by: Matthew Auld matthew.auld@intel.com Signed-off-by: Nirmoy Das nirmoy.das@intel.com --- drivers/gpu/drm/xe/xe_bo.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/drivers/gpu/drm/xe/xe_bo.c b/drivers/gpu/drm/xe/xe_bo.c index 73689dd7d672..b2aa368a23f8 100644 --- a/drivers/gpu/drm/xe/xe_bo.c +++ b/drivers/gpu/drm/xe/xe_bo.c @@ -733,7 +733,7 @@ static int xe_bo_move(struct ttm_buffer_object *ttm_bo, bool evict, new_mem->mem_type == XE_PL_SYSTEM) { long timeout = dma_resv_wait_timeout(ttm_bo->base.resv, DMA_RESV_USAGE_BOOKKEEP, - true, + false, MAX_SCHEDULE_TIMEOUT); if (timeout < 0) { ret = timeout;
On 05/12/2024 12:02, Nirmoy Das wrote:
Ensure a non-interruptible wait is used when moving a bo to XE_PL_SYSTEM. This prevents dma_mappings from being removed prematurely while a GPU job is still in progress, even if the CPU receives a signal during the operation.
Fixes: 75521e8b56e8 ("drm/xe: Perform dma_map when moving system buffer objects to TT") Cc: Thomas Hellström thomas.hellstrom@linux.intel.com Cc: Matthew Brost matthew.brost@intel.com Cc: Lucas De Marchi lucas.demarchi@intel.com Cc: stable@vger.kernel.org # v6.11+ Suggested-by: Matthew Auld matthew.auld@intel.com Signed-off-by: Nirmoy Das nirmoy.das@intel.com
Reviewed-by: Matthew Auld matthew.auld@intel.com
There could be still migration job going on while doing xe_tt_unmap_sg() which could trigger GPU page faults. Fix this by waiting for the migration job to finish.
v2: Use intr=false(Matt A)
Closes: https://gitlab.freedesktop.org/drm/xe/kernel/-/issues/3466 Fixes: 75521e8b56e8 ("drm/xe: Perform dma_map when moving system buffer objects to TT") Cc: Thomas Hellström thomas.hellstrom@linux.intel.com Cc: Matthew Brost matthew.brost@intel.com Cc: Lucas De Marchi lucas.demarchi@intel.com Cc: stable@vger.kernel.org # v6.11+ Cc: Matthew Auld matthew.auld@intel.com Signed-off-by: Nirmoy Das nirmoy.das@intel.com --- drivers/gpu/drm/xe/xe_bo.c | 10 +++++++++- 1 file changed, 9 insertions(+), 1 deletion(-)
diff --git a/drivers/gpu/drm/xe/xe_bo.c b/drivers/gpu/drm/xe/xe_bo.c index b2aa368a23f8..c906a5529db0 100644 --- a/drivers/gpu/drm/xe/xe_bo.c +++ b/drivers/gpu/drm/xe/xe_bo.c @@ -857,8 +857,16 @@ static int xe_bo_move(struct ttm_buffer_object *ttm_bo, bool evict,
out: if ((!ttm_bo->resource || ttm_bo->resource->mem_type == XE_PL_SYSTEM) && - ttm_bo->ttm) + ttm_bo->ttm) { + long timeout = dma_resv_wait_timeout(ttm_bo->base.resv, + DMA_RESV_USAGE_BOOKKEEP, + false, + MAX_SCHEDULE_TIMEOUT); + if (timeout < 0) + ret = timeout; + xe_tt_unmap_sg(ttm_bo->ttm); + }
return ret; }
On 05/12/2024 12:02, Nirmoy Das wrote:
There could be still migration job going on while doing xe_tt_unmap_sg() which could trigger GPU page faults. Fix this by waiting for the migration job to finish.
v2: Use intr=false(Matt A)
Closes: https://gitlab.freedesktop.org/drm/xe/kernel/-/issues/3466 Fixes: 75521e8b56e8 ("drm/xe: Perform dma_map when moving system buffer objects to TT") Cc: Thomas Hellström thomas.hellstrom@linux.intel.com Cc: Matthew Brost matthew.brost@intel.com Cc: Lucas De Marchi lucas.demarchi@intel.com Cc: stable@vger.kernel.org # v6.11+ Cc: Matthew Auld matthew.auld@intel.com Signed-off-by: Nirmoy Das nirmoy.das@intel.com
Ok, so this is something like ttm_bo_move_to_ghost() doing a pipeline move for tt -> system, but we then do xe_tt_unmap_sg() too early which tears down the IOMMU (if enabled) mappings whilst the job is in progress?
Maybe add some more info to the commit message? I think this for sure fixes it. Just wondering if it's somehow possible to keep the mapping until the job is done, since all tt -> sys moves are now synced here?
Unless Thomas has a better idea here, Reviewed-by: Matthew Auld matthew.auld@intel.com
drivers/gpu/drm/xe/xe_bo.c | 10 +++++++++- 1 file changed, 9 insertions(+), 1 deletion(-)
diff --git a/drivers/gpu/drm/xe/xe_bo.c b/drivers/gpu/drm/xe/xe_bo.c index b2aa368a23f8..c906a5529db0 100644 --- a/drivers/gpu/drm/xe/xe_bo.c +++ b/drivers/gpu/drm/xe/xe_bo.c @@ -857,8 +857,16 @@ static int xe_bo_move(struct ttm_buffer_object *ttm_bo, bool evict, out: if ((!ttm_bo->resource || ttm_bo->resource->mem_type == XE_PL_SYSTEM) &&
ttm_bo->ttm)
ttm_bo->ttm) {
long timeout = dma_resv_wait_timeout(ttm_bo->base.resv,
DMA_RESV_USAGE_BOOKKEEP,
false,
MAX_SCHEDULE_TIMEOUT);
if (timeout < 0)
ret = timeout;
- xe_tt_unmap_sg(ttm_bo->ttm);
- }
return ret; }
On 12/5/2024 1:40 PM, Matthew Auld wrote:
On 05/12/2024 12:02, Nirmoy Das wrote:
There could be still migration job going on while doing xe_tt_unmap_sg() which could trigger GPU page faults. Fix this by waiting for the migration job to finish.
v2: Use intr=false(Matt A)
Closes: https://gitlab.freedesktop.org/drm/xe/kernel/-/issues/3466 Fixes: 75521e8b56e8 ("drm/xe: Perform dma_map when moving system buffer objects to TT") Cc: Thomas Hellström thomas.hellstrom@linux.intel.com Cc: Matthew Brost matthew.brost@intel.com Cc: Lucas De Marchi lucas.demarchi@intel.com Cc: stable@vger.kernel.org # v6.11+ Cc: Matthew Auld matthew.auld@intel.com Signed-off-by: Nirmoy Das nirmoy.das@intel.com
Ok, so this is something like ttm_bo_move_to_ghost() doing a pipeline move for tt -> system, but we then do xe_tt_unmap_sg() too early which tears down the IOMMU (if enabled) mappings whilst the job is in progress?
Yes, this exactly what is happening for this issue.
Maybe add some more info to the commit message?
I will add more details.
I think this for sure fixes it. Just wondering if it's somehow possible to keep the mapping until the job is done, since all tt -> sys moves are now synced here?
Unless Thomas has a better idea here, Reviewed-by: Matthew Auld matthew.auld@intel.com
Thanks,
Nirmoy
drivers/gpu/drm/xe/xe_bo.c | 10 +++++++++- 1 file changed, 9 insertions(+), 1 deletion(-)
diff --git a/drivers/gpu/drm/xe/xe_bo.c b/drivers/gpu/drm/xe/xe_bo.c index b2aa368a23f8..c906a5529db0 100644 --- a/drivers/gpu/drm/xe/xe_bo.c +++ b/drivers/gpu/drm/xe/xe_bo.c @@ -857,8 +857,16 @@ static int xe_bo_move(struct ttm_buffer_object *ttm_bo, bool evict, out: if ((!ttm_bo->resource || ttm_bo->resource->mem_type == XE_PL_SYSTEM) && - ttm_bo->ttm) + ttm_bo->ttm) { + long timeout = dma_resv_wait_timeout(ttm_bo->base.resv, + DMA_RESV_USAGE_BOOKKEEP, + false, + MAX_SCHEDULE_TIMEOUT); + if (timeout < 0) + ret = timeout;
xe_tt_unmap_sg(ttm_bo->ttm); + } return ret; }
On Thu, Dec 05, 2024 at 03:33:46PM +0100, Nirmoy Das wrote:
On 12/5/2024 1:40 PM, Matthew Auld wrote:
On 05/12/2024 12:02, Nirmoy Das wrote:
There could be still migration job going on while doing xe_tt_unmap_sg() which could trigger GPU page faults. Fix this by waiting for the migration job to finish.
v2: Use intr=false(Matt A)
Closes: https://gitlab.freedesktop.org/drm/xe/kernel/-/issues/3466 Fixes: 75521e8b56e8 ("drm/xe: Perform dma_map when moving system buffer objects to TT") Cc: Thomas Hellström thomas.hellstrom@linux.intel.com Cc: Matthew Brost matthew.brost@intel.com Cc: Lucas De Marchi lucas.demarchi@intel.com Cc: stable@vger.kernel.org # v6.11+ Cc: Matthew Auld matthew.auld@intel.com Signed-off-by: Nirmoy Das nirmoy.das@intel.com
Ok, so this is something like ttm_bo_move_to_ghost() doing a pipeline move for tt -> system, but we then do xe_tt_unmap_sg() too early which tears down the IOMMU (if enabled) mappings whilst the job is in progress?
Yes, this exactly what is happening for this issue.
Maybe add some more info to the commit message?
I will add more details.
Are you going to send a new version? Once this is fixed, please also send a revert MR to the kconfig workaround 3940181b1bad @ gitlab.freedesktop.org/drm/xe/ci.git
Lucas De Marchi
I think this for sure fixes it. Just wondering if it's somehow possible to keep the mapping until the job is done, since all tt -> sys moves are now synced here?
Unless Thomas has a better idea here, Reviewed-by: Matthew Auld matthew.auld@intel.com
Thanks,
Nirmoy
drivers/gpu/drm/xe/xe_bo.c | 10 +++++++++- 1 file changed, 9 insertions(+), 1 deletion(-)
diff --git a/drivers/gpu/drm/xe/xe_bo.c b/drivers/gpu/drm/xe/xe_bo.c index b2aa368a23f8..c906a5529db0 100644 --- a/drivers/gpu/drm/xe/xe_bo.c +++ b/drivers/gpu/drm/xe/xe_bo.c @@ -857,8 +857,16 @@ static int xe_bo_move(struct ttm_buffer_object *ttm_bo, bool evict, out: if ((!ttm_bo->resource || ttm_bo->resource->mem_type == XE_PL_SYSTEM) && - ttm_bo->ttm) + ttm_bo->ttm) { + long timeout = dma_resv_wait_timeout(ttm_bo->base.resv, + DMA_RESV_USAGE_BOOKKEEP, + false, + MAX_SCHEDULE_TIMEOUT); + if (timeout < 0) + ret = timeout;
xe_tt_unmap_sg(ttm_bo->ttm); + } return ret; }
On 12/10/2024 4:38 PM, Lucas De Marchi wrote:
On Thu, Dec 05, 2024 at 03:33:46PM +0100, Nirmoy Das wrote:
On 12/5/2024 1:40 PM, Matthew Auld wrote:
On 05/12/2024 12:02, Nirmoy Das wrote:
There could be still migration job going on while doing xe_tt_unmap_sg() which could trigger GPU page faults. Fix this by waiting for the migration job to finish.
v2: Use intr=false(Matt A)
Closes: https://gitlab.freedesktop.org/drm/xe/kernel/-/issues/3466 Fixes: 75521e8b56e8 ("drm/xe: Perform dma_map when moving system buffer objects to TT") Cc: Thomas Hellström thomas.hellstrom@linux.intel.com Cc: Matthew Brost matthew.brost@intel.com Cc: Lucas De Marchi lucas.demarchi@intel.com Cc: stable@vger.kernel.org # v6.11+ Cc: Matthew Auld matthew.auld@intel.com Signed-off-by: Nirmoy Das nirmoy.das@intel.com
Ok, so this is something like ttm_bo_move_to_ghost() doing a pipeline move for tt -> system, but we then do xe_tt_unmap_sg() too early which tears down the IOMMU (if enabled) mappings whilst the job is in progress?
Yes, this exactly what is happening for this issue.
Maybe add some more info to the commit message?
I will add more details.
Are you going to send a new version?
Was waiting for more reviews. Sent out v3 with updated commit message.
Once this is fixed, please also send a revert MR to the kconfig workaround 3940181b1bad @ gitlab.freedesktop.org/drm/xe/ci.git
I will do that.
Thanks,
Nirmoy
Lucas De Marchi
I think this for sure fixes it. Just wondering if it's somehow possible to keep the mapping until the job is done, since all tt -> sys moves are now synced here?
Unless Thomas has a better idea here, Reviewed-by: Matthew Auld matthew.auld@intel.com
Thanks,
Nirmoy
drivers/gpu/drm/xe/xe_bo.c | 10 +++++++++- 1 file changed, 9 insertions(+), 1 deletion(-)
diff --git a/drivers/gpu/drm/xe/xe_bo.c b/drivers/gpu/drm/xe/xe_bo.c index b2aa368a23f8..c906a5529db0 100644 --- a/drivers/gpu/drm/xe/xe_bo.c +++ b/drivers/gpu/drm/xe/xe_bo.c @@ -857,8 +857,16 @@ static int xe_bo_move(struct ttm_buffer_object *ttm_bo, bool evict, out: if ((!ttm_bo->resource || ttm_bo->resource->mem_type == XE_PL_SYSTEM) && - ttm_bo->ttm) + ttm_bo->ttm) { + long timeout = dma_resv_wait_timeout(ttm_bo->base.resv, + DMA_RESV_USAGE_BOOKKEEP, + false, + MAX_SCHEDULE_TIMEOUT); + if (timeout < 0) + ret = timeout;
xe_tt_unmap_sg(ttm_bo->ttm); + } return ret; }
On Thu, 2024-12-05 at 12:40 +0000, Matthew Auld wrote:
On 05/12/2024 12:02, Nirmoy Das wrote:
There could be still migration job going on while doing xe_tt_unmap_sg() which could trigger GPU page faults. Fix this by waiting for the migration job to finish.
v2: Use intr=false(Matt A)
Closes: https://gitlab.freedesktop.org/drm/xe/kernel/-/issues/3466 Fixes: 75521e8b56e8 ("drm/xe: Perform dma_map when moving system buffer objects to TT") Cc: Thomas Hellström thomas.hellstrom@linux.intel.com Cc: Matthew Brost matthew.brost@intel.com Cc: Lucas De Marchi lucas.demarchi@intel.com Cc: stable@vger.kernel.org # v6.11+ Cc: Matthew Auld matthew.auld@intel.com Signed-off-by: Nirmoy Das nirmoy.das@intel.com
Ok, so this is something like ttm_bo_move_to_ghost() doing a pipeline move for tt -> system, but we then do xe_tt_unmap_sg() too early which tears down the IOMMU (if enabled) mappings whilst the job is in progress?
Maybe add some more info to the commit message? I think this for sure fixes it. Just wondering if it's somehow possible to keep the mapping until the job is done, since all tt -> sys moves are now synced here?
Unless Thomas has a better idea here,
Not at the moment. Ideally we should somehow attach the dma-map to the ghost object. Perhaps moving forward we could look at attaching it to the XE_PL_TT resource. Then I believe it will get freed with the ghost object.
/Thomas
Reviewed-by: Matthew Auld matthew.auld@intel.com
drivers/gpu/drm/xe/xe_bo.c | 10 +++++++++- 1 file changed, 9 insertions(+), 1 deletion(-)
diff --git a/drivers/gpu/drm/xe/xe_bo.c b/drivers/gpu/drm/xe/xe_bo.c index b2aa368a23f8..c906a5529db0 100644 --- a/drivers/gpu/drm/xe/xe_bo.c +++ b/drivers/gpu/drm/xe/xe_bo.c @@ -857,8 +857,16 @@ static int xe_bo_move(struct ttm_buffer_object *ttm_bo, bool evict, out: if ((!ttm_bo->resource || ttm_bo->resource->mem_type == XE_PL_SYSTEM) &&
- ttm_bo->ttm)
- ttm_bo->ttm) {
long timeout = dma_resv_wait_timeout(ttm_bo-
base.resv,
DMA_RESV_USAGE_BOOKKEEP,
false,
MAX_SCHEDULE_TIMEOUT);
if (timeout < 0)
ret = timeout;
xe_tt_unmap_sg(ttm_bo->ttm);
- }
return ret; }
linux-stable-mirror@lists.linaro.org