On Tigerlake, we are seeing a repeat of commit d8f505311717 ("drm/i915/icl: Forcibly evict stale csb entries") where, presumably, due to a missing Global Observation Point synchronisation, the write pointer of the CSB ringbuffer is updated _prior_ to the contents of the ringbuffer. That is we see the GPU report more context-switch entries for us to parse, but those entries have not been written, leading us to process stale events, and eventually report a hung GPU.
However, this effect appears to be much more severe than we previously saw on Icelake (though it might be best if we try the same approach there as well and measure), and Bruce suggested the good idea of resetting the CSB entry after use so that we can detect when it has been updated by the GPU. By instrumenting how long that may be, we can set a reliable upper bound for how long we should wait for:
513 late, avg of 61 retries (590 ns), max of 1061 retries (10099 ns)
Closes: https://gitlab.freedesktop.org/drm/intel/-/issues/2045 References: d8f505311717 ("drm/i915/icl: Forcibly evict stale csb entries") Suggested-by: Bruce Chang yu.bruce.chang@intel.com Signed-off-by: Chris Wilson chris@chris-wilson.co.uk Cc: Bruce Chang yu.bruce.chang@intel.com Cc: Mika Kuoppala mika.kuoppala@linux.intel.com Cc: stable@vger.kernel.org # v5.4 --- drivers/gpu/drm/i915/gt/intel_lrc.c | 21 ++++++++++++++++++--- 1 file changed, 18 insertions(+), 3 deletions(-)
diff --git a/drivers/gpu/drm/i915/gt/intel_lrc.c b/drivers/gpu/drm/i915/gt/intel_lrc.c index db982fc0f0bc..3b8161c6b601 100644 --- a/drivers/gpu/drm/i915/gt/intel_lrc.c +++ b/drivers/gpu/drm/i915/gt/intel_lrc.c @@ -2498,9 +2498,22 @@ invalidate_csb_entries(const u64 *first, const u64 *last) */ static inline bool gen12_csb_parse(const u64 *csb) { - u64 entry = READ_ONCE(*csb); - bool ctx_away_valid = GEN12_CSB_CTX_VALID(upper_32_bits(entry)); - bool new_queue = + bool ctx_away_valid; + bool new_queue; + u64 entry; + + /* XXX HSD */ + entry = READ_ONCE(*csb); + if (unlikely(entry == -1)) { + preempt_disable(); + if (wait_for_atomic_us((entry = READ_ONCE(*csb)) != -1, 50)) + GEM_WARN_ON("50us CSB timeout"); + preempt_enable(); + } + WRITE_ONCE(*(u64 *)csb, -1); + + ctx_away_valid = GEN12_CSB_CTX_VALID(upper_32_bits(entry)); + new_queue = lower_32_bits(entry) & GEN12_CTX_STATUS_SWITCHED_TO_NEW_QUEUE;
/* @@ -3995,6 +4008,8 @@ static void reset_csb_pointers(struct intel_engine_cs *engine) WRITE_ONCE(*execlists->csb_write, reset_value); wmb(); /* Make sure this is visible to HW (paranoia?) */
+ /* Check that the GPU does indeed update the CSB entries! */ + memset(execlists->csb_status, -1, (reset_value + 1) * sizeof(u64)); invalidate_csb_entries(&execlists->csb_status[0], &execlists->csb_status[reset_value]);
On 8/14/2020 8:57 AM, Chris Wilson wrote:
On Tigerlake, we are seeing a repeat of commit d8f505311717 ("drm/i915/icl: Forcibly evict stale csb entries") where, presumably, due to a missing Global Observation Point synchronisation, the write pointer of the CSB ringbuffer is updated _prior_ to the contents of the ringbuffer. That is we see the GPU report more context-switch entries for us to parse, but those entries have not been written, leading us to process stale events, and eventually report a hung GPU.
However, this effect appears to be much more severe than we previously saw on Icelake (though it might be best if we try the same approach there as well and measure), and Bruce suggested the good idea of resetting the CSB entry after use so that we can detect when it has been updated by the GPU. By instrumenting how long that may be, we can set a reliable upper bound for how long we should wait for:
513 late, avg of 61 retries (590 ns), max of 1061 retries (10099 ns)
Closes: https://gitlab.freedesktop.org/drm/intel/-/issues/2045 References: d8f505311717 ("drm/i915/icl: Forcibly evict stale csb entries") Suggested-by: Bruce Chang yu.bruce.chang@intel.com Signed-off-by: Chris Wilson chris@chris-wilson.co.uk Cc: Bruce Chang yu.bruce.chang@intel.com Cc: Mika Kuoppala mika.kuoppala@linux.intel.com Cc: stable@vger.kernel.org # v5.4
drivers/gpu/drm/i915/gt/intel_lrc.c | 21 ++++++++++++++++++--- 1 file changed, 18 insertions(+), 3 deletions(-)
diff --git a/drivers/gpu/drm/i915/gt/intel_lrc.c b/drivers/gpu/drm/i915/gt/intel_lrc.c index db982fc0f0bc..3b8161c6b601 100644 --- a/drivers/gpu/drm/i915/gt/intel_lrc.c +++ b/drivers/gpu/drm/i915/gt/intel_lrc.c @@ -2498,9 +2498,22 @@ invalidate_csb_entries(const u64 *first, const u64 *last) */ static inline bool gen12_csb_parse(const u64 *csb) {
- u64 entry = READ_ONCE(*csb);
- bool ctx_away_valid = GEN12_CSB_CTX_VALID(upper_32_bits(entry));
- bool new_queue =
- bool ctx_away_valid;
- bool new_queue;
- u64 entry;
- /* XXX HSD */
- entry = READ_ONCE(*csb);
- if (unlikely(entry == -1)) {
preempt_disable();
if (wait_for_atomic_us((entry = READ_ONCE(*csb)) != -1, 50))
GEM_WARN_ON("50us CSB timeout");
Out tests showed that 10us is not long enough, but 20us worked well. So 50us should be good enough.
preempt_enable();
- }
- WRITE_ONCE(*(u64 *)csb, -1);
A wmb() is probably needed here. it should be ok if CSB is in SMEM, but in the case CSB is allocated in LMEM, the memory type will be WC, so the memory write (WRITE_ONCE) is potentially still in the write combine buffer and not in any cache system, i.e., not visible to HW.
- ctx_away_valid = GEN12_CSB_CTX_VALID(upper_32_bits(entry));
- new_queue = lower_32_bits(entry) & GEN12_CTX_STATUS_SWITCHED_TO_NEW_QUEUE;
/* @@ -3995,6 +4008,8 @@ static void reset_csb_pointers(struct intel_engine_cs *engine) WRITE_ONCE(*execlists->csb_write, reset_value); wmb(); /* Make sure this is visible to HW (paranoia?) */
- /* Check that the GPU does indeed update the CSB entries! */
- memset(execlists->csb_status, -1, (reset_value + 1) * sizeof(u64)); invalidate_csb_entries(&execlists->csb_status[0], &execlists->csb_status[reset_value]);
Quoting Chang, Bruce (2020-08-14 19:07:53)
On 8/14/2020 8:57 AM, Chris Wilson wrote:
On Tigerlake, we are seeing a repeat of commit d8f505311717 ("drm/i915/icl: Forcibly evict stale csb entries") where, presumably, due to a missing Global Observation Point synchronisation, the write pointer of the CSB ringbuffer is updated _prior_ to the contents of the ringbuffer. That is we see the GPU report more context-switch entries for us to parse, but those entries have not been written, leading us to process stale events, and eventually report a hung GPU.
However, this effect appears to be much more severe than we previously saw on Icelake (though it might be best if we try the same approach there as well and measure), and Bruce suggested the good idea of resetting the CSB entry after use so that we can detect when it has been updated by the GPU. By instrumenting how long that may be, we can set a reliable upper bound for how long we should wait for:
513 late, avg of 61 retries (590 ns), max of 1061 retries (10099 ns)
Closes: https://gitlab.freedesktop.org/drm/intel/-/issues/2045 References: d8f505311717 ("drm/i915/icl: Forcibly evict stale csb entries") Suggested-by: Bruce Chang yu.bruce.chang@intel.com Signed-off-by: Chris Wilson chris@chris-wilson.co.uk Cc: Bruce Chang yu.bruce.chang@intel.com Cc: Mika Kuoppala mika.kuoppala@linux.intel.com Cc: stable@vger.kernel.org # v5.4
drivers/gpu/drm/i915/gt/intel_lrc.c | 21 ++++++++++++++++++--- 1 file changed, 18 insertions(+), 3 deletions(-)
diff --git a/drivers/gpu/drm/i915/gt/intel_lrc.c b/drivers/gpu/drm/i915/gt/intel_lrc.c index db982fc0f0bc..3b8161c6b601 100644 --- a/drivers/gpu/drm/i915/gt/intel_lrc.c +++ b/drivers/gpu/drm/i915/gt/intel_lrc.c @@ -2498,9 +2498,22 @@ invalidate_csb_entries(const u64 *first, const u64 *last) */ static inline bool gen12_csb_parse(const u64 *csb) {
u64 entry = READ_ONCE(*csb);
bool ctx_away_valid = GEN12_CSB_CTX_VALID(upper_32_bits(entry));
bool new_queue =
bool ctx_away_valid;
bool new_queue;
u64 entry;
/* XXX HSD */
entry = READ_ONCE(*csb);
if (unlikely(entry == -1)) {
preempt_disable();
if (wait_for_atomic_us((entry = READ_ONCE(*csb)) != -1, 50))
GEM_WARN_ON("50us CSB timeout");
Out tests showed that 10us is not long enough, but 20us worked well. So 50us should be good enough.
preempt_enable();
}
WRITE_ONCE(*(u64 *)csb, -1);
A wmb() is probably needed here. it should be ok if CSB is in SMEM, but in the case CSB is allocated in LMEM, the memory type will be WC, so the memory write (WRITE_ONCE) is potentially still in the write combine buffer and not in any cache system, i.e., not visible to HW.
There's a trick here. Before the GPU can wrap the CSB ringbuffer, there must be a register write that itself will flush the WCB. Not only that, we will have an explicit wmb() prior to that register write. Sneaky.
We probably want to avoid putting the HWSP into WC and fix whatever cache snooping is required. At least to the point of being able to measure the impact as we read from HWSP (and "HWSP") frequently. -Chris
@@ -2498,9 +2498,22 @@ invalidate_csb_entries(const u64 *first, const u64 *last) */ static inline bool gen12_csb_parse(const u64 *csb) {
u64 entry = READ_ONCE(*csb);
bool ctx_away_valid = GEN12_CSB_CTX_VALID(upper_32_bits(entry));
bool new_queue =
bool ctx_away_valid;
bool new_queue;
u64 entry;
/* XXX HSD */
entry = READ_ONCE(*csb);
if (unlikely(entry == -1)) {
preempt_disable();
if (wait_for_atomic_us((entry = READ_ONCE(*csb)) != -1, 50))
GEM_WARN_ON("50us CSB timeout");
Out tests showed that 10us is not long enough, but 20us worked well. So 50us should be good enough.
Just realized this may not fully work, as one of the common issue we run into is that higher 32bit is updated from the HW, but lower 32bit update at a later time: meaning the csb will read like 0xFFFFFFFF:xxxxxxxx (low:high) . So this check (!= -1) can still pass but with a partial invalid csb status. So, we may need to check each 32bit separately.
preempt_enable();
}
WRITE_ONCE(*(u64 *)csb, -1);
A wmb() is probably needed here. it should be ok if CSB is in SMEM, but in the case CSB is allocated in LMEM, the memory type will be WC, so the memory write (WRITE_ONCE) is potentially still in the write combine buffer and not in any cache system, i.e., not visible to HW.
On 8/14/2020 5:36 PM, Chang, Bruce wrote:
@@ -2498,9 +2498,22 @@ invalidate_csb_entries(const u64 *first, const u64 *last) */ static inline bool gen12_csb_parse(const u64 *csb) { - u64 entry = READ_ONCE(*csb); - bool ctx_away_valid = GEN12_CSB_CTX_VALID(upper_32_bits(entry)); - bool new_queue = + bool ctx_away_valid; + bool new_queue; + u64 entry;
+ /* XXX HSD */ + entry = READ_ONCE(*csb); + if (unlikely(entry == -1)) { + preempt_disable(); + if (wait_for_atomic_us((entry = READ_ONCE(*csb)) != -1, 50)) + GEM_WARN_ON("50us CSB timeout");
Out tests showed that 10us is not long enough, but 20us worked well. So 50us should be good enough.
Just realized this may not fully work, as one of the common issue we run into is that higher 32bit is updated from the HW, but lower 32bit update at a later time: meaning the csb will read like 0xFFFFFFFF:xxxxxxxx (low:high) . So this check (!= -1) can still pass but with a partial invalid csb status. So, we may need to check each 32bit separately.
After tested, with the new 64bit read, the above issue never happened so far. So, it seems this only applicable to 32bit read (CSB updated between the two lower and high 32bit reads). Assuming the HW 64bit CSB update is also atomic, the above code should be fine.
+ preempt_enable(); + } + WRITE_ONCE(*(u64 *)csb, -1);
A wmb() is probably needed here. it should be ok if CSB is in SMEM, but in the case CSB is allocated in LMEM, the memory type will be WC, so the memory write (WRITE_ONCE) is potentially still in the write combine buffer and not in any cache system, i.e., not visible to HW.
Intel-gfx mailing list Intel-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/intel-gfx
Quoting Chang, Bruce (2020-08-15 03:16:58)
On 8/14/2020 5:36 PM, Chang, Bruce wrote:
@@ -2498,9 +2498,22 @@ invalidate_csb_entries(const u64 *first, const u64 *last) */ static inline bool gen12_csb_parse(const u64 *csb) { - u64 entry = READ_ONCE(*csb); - bool ctx_away_valid = GEN12_CSB_CTX_VALID(upper_32_bits(entry)); - bool new_queue = + bool ctx_away_valid; + bool new_queue; + u64 entry;
+ /* XXX HSD */ + entry = READ_ONCE(*csb); + if (unlikely(entry == -1)) { + preempt_disable(); + if (wait_for_atomic_us((entry = READ_ONCE(*csb)) != -1, 50)) + GEM_WARN_ON("50us CSB timeout");
Out tests showed that 10us is not long enough, but 20us worked well. So 50us should be good enough.
Just realized this may not fully work, as one of the common issue we run into is that higher 32bit is updated from the HW, but lower 32bit update at a later time: meaning the csb will read like 0xFFFFFFFF:xxxxxxxx (low:high) . So this check (!= -1) can still pass but with a partial invalid csb status. So, we may need to check each 32bit separately.
After tested, with the new 64bit read, the above issue never happened so far. So, it seems this only applicable to 32bit read (CSB updated between the two lower and high 32bit reads). Assuming the HW 64bit CSB update is also atomic, the above code should be fine.
Fortunately for all the platforms we care about here, READ_ONCE(u64) will be a single 64b read and so both lower/upper dwords will be pulled from the same bus transfer. We really need a compiler warning for when READ_ONCE() is not a singular atomic operation. atomic64_t has too much connotation with cross-core atomicity for my liking when dealing with [cacheable] mmio semantics. -Chris
Quoting Chang, Bruce (2020-08-15 01:36:10)
@@ -2498,9 +2498,22 @@ invalidate_csb_entries(const u64 *first, const u64 *last) */ static inline bool gen12_csb_parse(const u64 *csb) {
u64 entry = READ_ONCE(*csb);
bool ctx_away_valid = GEN12_CSB_CTX_VALID(upper_32_bits(entry));
bool new_queue =
bool ctx_away_valid;
bool new_queue;
u64 entry;
/* XXX HSD */
entry = READ_ONCE(*csb);
if (unlikely(entry == -1)) {
preempt_disable();
if (wait_for_atomic_us((entry = READ_ONCE(*csb)) != -1, 50))
GEM_WARN_ON("50us CSB timeout");
Out tests showed that 10us is not long enough, but 20us worked well. So 50us should be good enough.
Just realized this may not fully work, as one of the common issue we run into is that higher 32bit is updated from the HW, but lower 32bit update at a later time: meaning the csb will read like 0xFFFFFFFF:xxxxxxxx (low:high) . So this check (!= -1) can still pass but with a partial invalid csb status. So, we may need to check each 32bit separately.
Hence the transformation to use u64 as the entry type :) -Chris
linux-stable-mirror@lists.linaro.org