On Thu, 28 Jul 2022 08:32:32 +0200 Mauro Carvalho Chehab mauro.chehab@linux.intel.com wrote:
On Wed, 27 Jul 2022 13:56:50 +0100 Tvrtko Ursulin tvrtko.ursulin@linux.intel.com wrote:
Because vma_invalidate_tlb() basically stores a TLB seqno, but the actual invalidation is deferred to when the pages are unset, at __i915_gem_object_unset_pages().
So, what happens is:
- on VMA sync mode, the need to invalidate TLB is marked at __vma_put_pages(), before VMA unbind;
- on async, this is deferred to happen at ppgtt_unbind_vma(), where it marks the need to invalidate TLBs.
On both cases, __i915_gem_object_unset_pages() is called later, when the driver is ready to unmap the page.
Sorry still not clear to me why is the patch moving marking of the need to invalidate (regardless if it a bit like today, or a seqno like in this patch) from bind to unbind?
What if the seqno was stored in i915_vma_bind, where the bit is set today, and all the hunks which touch the unbind and evict would disappear from the patch. What wouldn't work in that case, if anything?
Ah, now I see your point.
I can't see any sense on having a sequence number at VMA bind, as the unbind order can be different. The need of doing a full TLB invalidation or not depends on the unbind order.
The way the current algorithm works is that drm_i915_gem_object can be created on any order, and, at unbind/evict, they receive a seqno.
The seqno is incremented at intel_gt_invalidate_tlb():
void intel_gt_invalidate_tlb(struct intel_gt *gt, u32 seqno) {
with_intel_gt_pm_if_awake(gt, wakeref) { mutex_lock(>->tlb.invalidate_lock); if (tlb_seqno_passed(gt, seqno)) goto unlock;
mmio_invalidate_full(gt); write_seqcount_invalidate(>->tlb.seqno); // increment seqno
So, let's say 3 objects were created, on this order:
obj1 obj2 obj3
They would be unbind/evict on a different order. On that time, the mm.tlb will be stamped with a seqno, using the number from the last TLB flush, plus 1.
As different threads can be used to handle TLB flushes, let's imagine two threads (just for the sake of having an example). On such case, what we would have is:
seqno Thread 0 Thread 1
seqno=2 unbind/evict event obj3.mm.tlb = seqno | 1 seqno=2 unbind/evict event obj1.mm.tlb = seqno | 1 __i915_gem_object_unset_pages() called for obj3, TLB flush happened, invalidating both obj1 and obj2. seqno += 2 seqno=4 unbind/evict event obj1.mm.tlb = seqno | 1
cut-and-paste typo. it should be, instead:
obj2.mm.tlb = seqno | 1
__i915_gem_object_unset_pages() called for obj1, don't flush.
... __i915_gem_object_unset_pages() called for obj2, TLB flush happened seqno += 2 seqno=6
So, basically the seqno is used to track when the object data stopped being updated, because of an unbind/evict event, being later used by intel_gt_invalidate_tlb() when called from __i915_gem_object_unset_pages(), in order to check if a previous invalidation call was enough to invalidate the object, or if a new call is needed.
Now, if seqno is stored at bind, data can still leak, as the assumption made by intel_gt_invalidate_tlb() that the data stopped being used at seqno is not true anymore.
Still, I agree that this logic is complex and should be better documented. So, if you're now OK with this patch, I'll add the above explanation inside a kernel-doc comment.
I'm enclosing the kernel-doc patch (to be applied after moving the code into its own files: intel_tlb.c/intel_tlb.h):
[PATCH] drm/i915/gt: document TLB cache invalidation functions
Add a description for the TLB cache invalidation algorithm and for the related kAPI functions.
Signed-off-by: Mauro Carvalho Chehab mchehab@kernel.org
diff --git a/drivers/gpu/drm/i915/gt/intel_tlb.c b/drivers/gpu/drm/i915/gt/intel_tlb.c index af8cae979489..8eda0743da74 100644 --- a/drivers/gpu/drm/i915/gt/intel_tlb.c +++ b/drivers/gpu/drm/i915/gt/intel_tlb.c @@ -145,6 +145,18 @@ static void mmio_invalidate_full(struct intel_gt *gt) intel_uncore_forcewake_put_delayed(uncore, FORCEWAKE_ALL); }
+/** + * intel_gt_invalidate_tlb_full - do full TLB cache invalidation + * @gt: GT structure + * @seqno: sequence number + * + * Do a full TLB cache invalidation if the @seqno is bigger than the last + * full TLB cache invalidation. + * + * Note: + * The TLB cache invalidation logic depends on GEN-specific registers. + * It currently supports GEN8 to GEN12 and GuC-based TLB cache invalidation. + */ void intel_gt_invalidate_tlb_full(struct intel_gt *gt, u32 seqno) { intel_wakeref_t wakeref; @@ -177,6 +189,12 @@ void intel_gt_init_tlb(struct intel_gt *gt) seqcount_mutex_init(>->tlb.seqno, >->tlb.invalidate_lock); }
+/** + * intel_gt_fini_tlb - initialize TLB-specific vars + * @gt: GT structure + * + * Frees any resources needed by TLB cache invalidation logic. + */ void intel_gt_fini_tlb(struct intel_gt *gt) { mutex_destroy(>->tlb.invalidate_lock); diff --git a/drivers/gpu/drm/i915/gt/intel_tlb.h b/drivers/gpu/drm/i915/gt/intel_tlb.h index 46ce25bf5afe..d186f5d5901f 100644 --- a/drivers/gpu/drm/i915/gt/intel_tlb.h +++ b/drivers/gpu/drm/i915/gt/intel_tlb.h @@ -11,16 +11,99 @@
#include "intel_gt_types.h"
+/** + * DOC: TLB cache invalidation logic + * + * The way the current algorithm works is that drm_i915_gem_object can be + * created on any order. At unbind/evict time, the object is warranted that + * it won't be used anymore. So, they store a sequence number provided by + * intel_gt_next_invalidate_tlb_full().This can happen either at + * __vma_put_pages(), for VMA sync unbind, or at ppgtt_unbind_vma(), for + * VMA async VMA bind. + * + * At __i915_gem_object_unset_pages(), intel_gt_invalidate_tlb() is called, + * where it checks if the sequence number of the object was already invalidated + * or not. If not, it increments it:: + * + * void intel_gt_invalidate_tlb(struct intel_gt *gt, u32 seqno) + * { + * ... + * with_intel_gt_pm_if_awake(gt, wakeref) { + * mutex_lock(>->tlb.invalidate_lock); + * if (tlb_seqno_passed(gt, seqno)) + * goto unlock; + * + * mmio_invalidate_full(gt); + * + * write_seqcount_invalidate(>->tlb.seqno); // increment seqno + * ... + * + * So, let's say the current seqno is 2 and 3 new objects were created, + * on this order: + * + * obj1 + * obj2 + * obj3 + * + * They can be unbind/evict on a different order. At unbind/evict time, + * the mm.tlb will be stamped with the sequence number, using the number + * from the last TLB flush, plus 1. + * + * Different threads may be used on unbind/evict and/or unset pages. + * + * As the logic at void intel_gt_invalidate_tlb() is protected by a mutex, + * for simplicity, let's consider just two threads:: + * + * sequence number Thread 0 Thread 1 + * + * seqno=2 + * unbind/evict event + * obj3.mm.tlb = seqno | 1 + * + * unbind/evict event + * obj1.mm.tlb = seqno | 1 + * __i915_gem_object_unset_pages() + * called for obj3 => TLB flush + * invalidating both obj1 and obj2. + * seqno += 2 + * seqno=4 + * unbind/evict event + * obj2.mm.tlb = seqno | 1 + * __i915_gem_object_unset_pages() + * called for obj1, don't flush, + * as past flush invalidated obj1 + * + * __i915_gem_object_unset_pages() + * called for obj2 => TLB flush + * seqno += 2 + * seqno=6 + */ + void intel_gt_invalidate_tlb_full(struct intel_gt *gt, u32 seqno);
void intel_gt_init_tlb(struct intel_gt *gt); void intel_gt_fini_tlb(struct intel_gt *gt);
+/** + * intel_gt_tlb_seqno - Returns the current TLB invlidation sequence number + * + * @gt: GT structure + * + * There's no need to lock while calling it, as seqprop_sequence is thread-safe + */ static inline u32 intel_gt_tlb_seqno(const struct intel_gt *gt) { return seqprop_sequence(>->tlb.seqno); }
+/** + * intel_gt_next_invalidate_tlb_full - Returns the next TLB full invalidation + * sequence number + * + * @gt: GT structure + * + * There's no need to lock while calling it, as seqprop_sequence is thread-safe + */ static inline u32 intel_gt_next_invalidate_tlb_full(const struct intel_gt *gt) { return intel_gt_tlb_seqno(gt) | 1;