Exercising the gpu reloc path strenuously revealed an issue where the updated relocations (from MI_STORE_DWORD_IMM) were not being observed upon execution. After some experiments with adding pipecontrols (a lot of pipecontrols (32) as gen4/5 do not have a bit to wait on earlier pipe controls or even the current on), it was discovered that we merely needed to delay the EMIT_INVALIDATE by several flushes. It is important to note that it is the EMIT_INVALIDATE as opposed to the EMIT_FLUSH that needs the delay as opposed to what one might first expect -- that the delay is required for the TLB invalidation to take effect (one presumes to purge any CS buffers) as opposed to a delay after flushing to ensure the writes have landed before triggering invalidation.
Testcase: igt/gem_tiled_fence_blits Signed-off-by: Chris Wilson chris@chris-wilson.co.uk Cc: stable@vger.kernel.org --- drivers/gpu/drm/i915/intel_ringbuffer.c | 38 +++++++++++++++++++++++-- 1 file changed, 36 insertions(+), 2 deletions(-)
diff --git a/drivers/gpu/drm/i915/intel_ringbuffer.c b/drivers/gpu/drm/i915/intel_ringbuffer.c index b8a7a014d46d..87eebc13c0d8 100644 --- a/drivers/gpu/drm/i915/intel_ringbuffer.c +++ b/drivers/gpu/drm/i915/intel_ringbuffer.c @@ -91,6 +91,7 @@ static int gen4_render_ring_flush(struct i915_request *rq, u32 mode) { u32 cmd, *cs; + int i;
/* * read/write caches: @@ -127,12 +128,45 @@ gen4_render_ring_flush(struct i915_request *rq, u32 mode) cmd |= MI_INVALIDATE_ISP; }
- cs = intel_ring_begin(rq, 2); + i = 2; + if (mode & EMIT_INVALIDATE) + i += 20; + + cs = intel_ring_begin(rq, i); if (IS_ERR(cs)) return PTR_ERR(cs);
*cs++ = cmd; - *cs++ = MI_NOOP; + + /* + * A random delay to let the CS invalidate take effect? Without this + * delay, the GPU relocation path fails as the CS does not see + * the updated contents. Just as important, if we apply the flushes + * to the EMIT_FLUSH branch (i.e. immediately after the relocation + * write and before the invalidate on the next batch), the relocations + * still fail. This implies that is a delay following invalidation + * that is required to reset the caches as opposed to a delay to + * ensure the memory is written. + */ + if (mode & EMIT_INVALIDATE) { + *cs++ = GFX_OP_PIPE_CONTROL(4) | PIPE_CONTROL_QW_WRITE; + *cs++ = i915_ggtt_offset(rq->engine->scratch) | + PIPE_CONTROL_GLOBAL_GTT; + *cs++ = 0; + *cs++ = 0; + + for (i = 0; i < 12; i++) + *cs++ = MI_FLUSH; + + *cs++ = GFX_OP_PIPE_CONTROL(4) | PIPE_CONTROL_QW_WRITE; + *cs++ = i915_ggtt_offset(rq->engine->scratch) | + PIPE_CONTROL_GLOBAL_GTT; + *cs++ = 0; + *cs++ = 0; + } + + *cs++ = cmd; + intel_ring_advance(rq, cs);
return 0;
On Mon, Nov 05, 2018 at 09:43:05AM +0000, Chris Wilson wrote:
Exercising the gpu reloc path strenuously revealed an issue where the updated relocations (from MI_STORE_DWORD_IMM) were not being observed upon execution. After some experiments with adding pipecontrols (a lot of pipecontrols (32) as gen4/5 do not have a bit to wait on earlier pipe controls or even the current on), it was discovered that we merely needed to delay the EMIT_INVALIDATE by several flushes. It is important to note that it is the EMIT_INVALIDATE as opposed to the EMIT_FLUSH that needs the delay as opposed to what one might first expect -- that the delay is required for the TLB invalidation to take effect (one presumes to purge any CS buffers) as opposed to a delay after flushing to ensure the writes have landed before triggering invalidation.
Testcase: igt/gem_tiled_fence_blits Signed-off-by: Chris Wilson chris@chris-wilson.co.uk Cc: stable@vger.kernel.org
drivers/gpu/drm/i915/intel_ringbuffer.c | 38 +++++++++++++++++++++++-- 1 file changed, 36 insertions(+), 2 deletions(-)
diff --git a/drivers/gpu/drm/i915/intel_ringbuffer.c b/drivers/gpu/drm/i915/intel_ringbuffer.c index b8a7a014d46d..87eebc13c0d8 100644 --- a/drivers/gpu/drm/i915/intel_ringbuffer.c +++ b/drivers/gpu/drm/i915/intel_ringbuffer.c @@ -91,6 +91,7 @@ static int gen4_render_ring_flush(struct i915_request *rq, u32 mode) { u32 cmd, *cs;
- int i;
/* * read/write caches: @@ -127,12 +128,45 @@ gen4_render_ring_flush(struct i915_request *rq, u32 mode) cmd |= MI_INVALIDATE_ISP; }
- cs = intel_ring_begin(rq, 2);
- i = 2;
- if (mode & EMIT_INVALIDATE)
i += 20;
- cs = intel_ring_begin(rq, i); if (IS_ERR(cs)) return PTR_ERR(cs);
*cs++ = cmd;
- *cs++ = MI_NOOP;
- /*
* A random delay to let the CS invalidate take effect? Without this
* delay, the GPU relocation path fails as the CS does not see
* the updated contents. Just as important, if we apply the flushes
* to the EMIT_FLUSH branch (i.e. immediately after the relocation
* write and before the invalidate on the next batch), the relocations
* still fail. This implies that is a delay following invalidation
* that is required to reset the caches as opposed to a delay to
* ensure the memory is written.
*/
- if (mode & EMIT_INVALIDATE) {
*cs++ = GFX_OP_PIPE_CONTROL(4) | PIPE_CONTROL_QW_WRITE;
*cs++ = i915_ggtt_offset(rq->engine->scratch) |
PIPE_CONTROL_GLOBAL_GTT;
*cs++ = 0;
*cs++ = 0;
for (i = 0; i < 12; i++)
*cs++ = MI_FLUSH;
*cs++ = GFX_OP_PIPE_CONTROL(4) | PIPE_CONTROL_QW_WRITE;
*cs++ = i915_ggtt_offset(rq->engine->scratch) |
PIPE_CONTROL_GLOBAL_GTT;
*cs++ = 0;
*cs++ = 0;
- }
This smells a lot like the snb a/b w/a, except there the spec says to use 8 STORE_DWORDS. I suppose the choice of a specific command isn't critical, and it's just a matter of stuffing the pipeline with something that's takes long enough to let the TLB invalidate finish?
Anyways, patch itself seems as reasonable as one might expect for an issue like this.
Reviewed-by: Ville Syrjälä ville.syrjala@linux.intel.com
- *cs++ = cmd;
- intel_ring_advance(rq, cs);
return 0; -- 2.19.1
Intel-gfx mailing list Intel-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/intel-gfx
Quoting Ville Syrjälä (2018-11-07 15:04:24)
On Mon, Nov 05, 2018 at 09:43:05AM +0000, Chris Wilson wrote:
Exercising the gpu reloc path strenuously revealed an issue where the updated relocations (from MI_STORE_DWORD_IMM) were not being observed upon execution. After some experiments with adding pipecontrols (a lot of pipecontrols (32) as gen4/5 do not have a bit to wait on earlier pipe controls or even the current on), it was discovered that we merely needed to delay the EMIT_INVALIDATE by several flushes. It is important to note that it is the EMIT_INVALIDATE as opposed to the EMIT_FLUSH that needs the delay as opposed to what one might first expect -- that the delay is required for the TLB invalidation to take effect (one presumes to purge any CS buffers) as opposed to a delay after flushing to ensure the writes have landed before triggering invalidation.
Testcase: igt/gem_tiled_fence_blits Signed-off-by: Chris Wilson chris@chris-wilson.co.uk Cc: stable@vger.kernel.org
drivers/gpu/drm/i915/intel_ringbuffer.c | 38 +++++++++++++++++++++++-- 1 file changed, 36 insertions(+), 2 deletions(-)
diff --git a/drivers/gpu/drm/i915/intel_ringbuffer.c b/drivers/gpu/drm/i915/intel_ringbuffer.c index b8a7a014d46d..87eebc13c0d8 100644 --- a/drivers/gpu/drm/i915/intel_ringbuffer.c +++ b/drivers/gpu/drm/i915/intel_ringbuffer.c @@ -91,6 +91,7 @@ static int gen4_render_ring_flush(struct i915_request *rq, u32 mode) { u32 cmd, *cs;
int i;
/* * read/write caches: @@ -127,12 +128,45 @@ gen4_render_ring_flush(struct i915_request *rq, u32 mode) cmd |= MI_INVALIDATE_ISP; }
cs = intel_ring_begin(rq, 2);
i = 2;
if (mode & EMIT_INVALIDATE)
i += 20;
cs = intel_ring_begin(rq, i); if (IS_ERR(cs)) return PTR_ERR(cs);
*cs++ = cmd;
*cs++ = MI_NOOP;
/*
* A random delay to let the CS invalidate take effect? Without this
* delay, the GPU relocation path fails as the CS does not see
* the updated contents. Just as important, if we apply the flushes
* to the EMIT_FLUSH branch (i.e. immediately after the relocation
* write and before the invalidate on the next batch), the relocations
* still fail. This implies that is a delay following invalidation
* that is required to reset the caches as opposed to a delay to
* ensure the memory is written.
*/
if (mode & EMIT_INVALIDATE) {
*cs++ = GFX_OP_PIPE_CONTROL(4) | PIPE_CONTROL_QW_WRITE;
*cs++ = i915_ggtt_offset(rq->engine->scratch) |
PIPE_CONTROL_GLOBAL_GTT;
*cs++ = 0;
*cs++ = 0;
for (i = 0; i < 12; i++)
*cs++ = MI_FLUSH;
*cs++ = GFX_OP_PIPE_CONTROL(4) | PIPE_CONTROL_QW_WRITE;
*cs++ = i915_ggtt_offset(rq->engine->scratch) |
PIPE_CONTROL_GLOBAL_GTT;
*cs++ = 0;
*cs++ = 0;
}
This smells a lot like the snb a/b w/a, except there the spec says to use 8 STORE_DWORDS.
Yeah, the similarity wasn't lost, except that w/a is to cover the coherency aspect of the writes not being flushed. This feels a bit fishier in that the experiments indicate it's an issue on the invalidation path as opposed to flushing the writes.
And the other w/a to use umpteen pipecontrols to get around the lack of PIPE_CONTROL_FLUSH.
I suppose the choice of a specific command isn't critical, and it's just a matter of stuffing the pipeline with something that's takes long enough to let the TLB invalidate finish?
Except the MI_FLUSH are more effective in fewer number than PIPE_CONTROLs. Probably because each one translates to a few pipe controls or something, quite heavy.
Anyways, patch itself seems as reasonable as one might expect for an issue like this.
As nasty as one would expect.
For the record, we are not entirely out of danger. gem_exec_whisper continues to indicate a problem, but one step at a time. (I haven't yet found quite what's upsetting it yet, except if we do each batch synchronously and verify each one, it's happy.) -Chris
linux-stable-mirror@lists.linaro.org