[PATCH] arm64/mm: Add memory barrier for mm_cid

List overview All Threads
Download

newer

older

[PATCH] fs/aio: Check IOCB_AIO_RW...

[PATCH v3] acpi: Use access_width...

levi.yun

5 Mar 2024 5 Mar '24

2:53 p.m.

Currently arm64's switch_mm() doesn't always have an smp_mb() which the core scheduler code has depended upon since commit:

commit 223baf9d17f25 ("sched: Fix performance regression introduced by mm_cid")

If switch_mm() doesn't call smp_mb(), sched_mm_cid_remote_clear() can unset the activly used cid when it fails to observe active task after it sets lazy_put.

By adding an smp_mb() in arm64's check_and_switch_context(), Guarantee to observe active task after sched_mm_cid_remote_clear() success to set lazy_put.

Signed-off-by: levi.yun yeoreum.yun@arm.com Fixes: 223baf9d17f2 ("sched: Fix performance regression introduced by mm_cid") Cc: stable@vger.kernel.org # 6.4.x Cc: Mathieu Desnoyers mathieu.desnoyers@efficios.com Cc: Catalin Marinas catalin.marinas@arm.com Cc: Mark Rutland mark.rutland@arm.com Cc: Will Deacon will@kernel.org Cc: Peter Zijlstra peterz@infradead.org Cc: Aaron Lu aaron.lu@intel.com --- I'm really sorry if you got this multiple times. I had some problems with the SMTP server...

arch/arm64/mm/context.c | 5 +++++ 1 file changed, 5 insertions(+)

diff --git a/arch/arm64/mm/context.c b/arch/arm64/mm/context.c index 188197590fc9..7a9e8e6647a0 100644 --- a/arch/arm64/mm/context.c +++ b/arch/arm64/mm/context.c @@ -268,6 +268,11 @@ void check_and_switch_context(struct mm_struct *mm) */ if (!system_uses_ttbr0_pan()) cpu_switch_mm(mm->pgd, mm); + + /* + * See the comments on switch_mm_cid describing user -> user transition. + */ + smp_mb(); }

unsigned long arm64_mm_context_get(struct mm_struct *mm) -- LEVI:{C3F47F37-75D8-414A-A8BA-3980EC8A46D7}

Show replies by date

Will Deacon

5 Mar 5 Mar

5:13 p.m.

On Tue, Mar 05, 2024 at 02:53:35PM +0000, levi.yun wrote:

...

Currently arm64's switch_mm() doesn't always have an smp_mb() which the core scheduler code has depended upon since commit:
commit 223baf9d17f25 ("sched: Fix performance regression introduced by mm_cid")
If switch_mm() doesn't call smp_mb(), sched_mm_cid_remote_clear() can unset the activly used cid when it fails to observe active task after it sets lazy_put.

By adding an smp_mb() in arm64's check_and_switch_context(), Guarantee to observe active task after sched_mm_cid_remote_clear() success to set lazy_put.

Signed-off-by: levi.yun yeoreum.yun@arm.com Fixes: 223baf9d17f2 ("sched: Fix performance regression introduced by mm_cid") Cc: stable@vger.kernel.org # 6.4.x Cc: Mathieu Desnoyers mathieu.desnoyers@efficios.com Cc: Catalin Marinas catalin.marinas@arm.com Cc: Mark Rutland mark.rutland@arm.com Cc: Will Deacon will@kernel.org Cc: Peter Zijlstra peterz@infradead.org Cc: Aaron Lu aaron.lu@intel.com

I'm really sorry if you got this multiple times. I had some problems with the SMTP server...

arch/arm64/mm/context.c | 5 +++++ 1 file changed, 5 insertions(+)

diff --git a/arch/arm64/mm/context.c b/arch/arm64/mm/context.c index 188197590fc9..7a9e8e6647a0 100644 --- a/arch/arm64/mm/context.c +++ b/arch/arm64/mm/context.c @@ -268,6 +268,11 @@ void check_and_switch_context(struct mm_struct *mm) */ if (!system_uses_ttbr0_pan()) cpu_switch_mm(mm->pgd, mm);
/*
* See the comments on switch_mm_cid describing user -> user transition.
*/
smp_mb();
}

We already have a stronger barrier than smp_mb() (dsb ish) in __switch_to(). Is that not sufficient?

Will

levi.yun

6:52 p.m.

Hi will.

...

We already have a stronger barrier than smp_mb() (dsb ish) in __switch_to(). Is that not sufficient?

IIUC, It's not sufficient with smp_mb() in __switch_to().

Because, it can be broken in sched_mm_cid_remote_clear()

CPU0 in __schedule() CPU1 in sched_mm_cid_remote_clear() rq->curr = new_task; <no barrier> mm_get_cid remote_clear - check valid cid and use it. Invalidate CID. <barrier> rq->curr (not observed). unset the cid (<<BUG).

If change of rq->curr couldn't be observed in sched_mm_cid_remote_clear(), It could unset actively used cid. Note that __switch_to()'s smp_mb() is called AFTER switch_mm_cid(). That means before __switch_to(), there's possibility that sched_mm_cid_remote_clear() couldn't observe new active task, after it sets lazy_put on active cid used by new active task.

Mathieu Desnoyers

8:01 p.m.

On 2024-03-05 09:53, levi.yun wrote:

...

Currently arm64's switch_mm() doesn't always have an smp_mb() which the core scheduler code has depended upon since commit:
 commit 223baf9d17f25 ("sched: Fix performance regression introduced by mm_cid")
If switch_mm() doesn't call smp_mb(), sched_mm_cid_remote_clear() can unset the activly used cid when it fails to observe active task after it sets lazy_put.

By adding an smp_mb() in arm64's check_and_switch_context(), Guarantee to observe active task after sched_mm_cid_remote_clear() success to set lazy_put.

This comment from the original implementation of membarrier MEMBARRIER_CMD_PRIVATE_EXPEDITED states that the original need from membarrier was to have a full barrier between storing to rq->curr and return to userspace:

commit 22e4ebb9758 ("membarrier: Provide expedited private command")

commit message:

* Our TSO archs can do RELEASE without being a full barrier. Look at x86 spin_unlock() being a regular STORE for example. But for those archs, all atomics imply smp_mb and all of them have atomic ops in switch_mm() for mm_cpumask(), and on x86 the CR3 load acts as a full barrier.

* From all weakly ordered machines, only ARM64 and PPC can do RELEASE, the rest does indeed do smp_mb(), so there the spin_unlock() is a full barrier and we're good.

* ARM64 has a very heavy barrier in switch_to(), which suffices.

* PPC just removed its barrier from switch_to(), but appears to be talking about adding something to switch_mm(). So add a smp_mb__after_unlock_lock() for now, until this is settled on the PPC side.

associated code:

+ /* + * The membarrier system call requires each architecture + * to have a full memory barrier after updating + * rq->curr, before returning to user-space. For TSO + * (e.g. x86), the architecture must provide its own + * barrier in switch_mm(). For weakly ordered machines + * for which spin_unlock() acts as a full memory + * barrier, finish_lock_switch() in common code takes + * care of this barrier. For weakly ordered machines for + * which spin_unlock() acts as a RELEASE barrier (only + * arm64 and PowerPC), arm64 has a full barrier in + * switch_to(), and PowerPC has + * smp_mb__after_unlock_lock() before + * finish_lock_switch(). + */

Which got updated to this by

commit 306e060435d ("membarrier: Document scheduler barrier requirements")

/* * The membarrier system call requires each architecture * to have a full memory barrier after updating + * rq->curr, before returning to user-space. + * + * Here are the schemes providing that barrier on the + * various architectures: + * - mm ? switch_mm() : mmdrop() for x86, s390, sparc, PowerPC. + * switch_mm() rely on membarrier_arch_switch_mm() on PowerPC. + * - finish_lock_switch() for weakly-ordered + * architectures where spin_unlock is a full barrier, + * - switch_to() for arm64 (weakly-ordered, spin_unlock + * is a RELEASE barrier), */

However, rseq mm_cid has stricter requirements: the barrier needs to be issued between store to rq->curr and switch_mm_cid(), which happens earlier than:

- spin_unlock(), - switch_to().

So it's fine when the architecture switch_mm happens to have that barrier already, but less so when the architecture only provides the full barrier in switch_to() or spin_unlock().

The issue is therefore not specific to arm64, it's actually a bug in the rseq switch_mm_cid() implementation. All architectures that don't have memory barriers in switch_mm(), but rather have the full barrier either in finish_lock_switch() or switch_to() have them too late for the needs of switch_mm_cid().

I would recommend one of three approaches here:

A) Add smp_mb() in switch_mm_cid() for all architectures that lack that barrier in switch_mm().

B) Figure out if we can move switch_mm_cid() further down in the scheduler without breaking anything (within switch_to(), at the very end of finish_lock_switch() for instance). I'm not sure we can do that though because switch_mm_cid() touches the "prev" which is tricky after switch_to().

C) Add barriers in switch_mm() within all architectures that are missing it.

Thoughts ?

Thanks,

Mathieu

...

Signed-off-by: levi.yun yeoreum.yun@arm.com Fixes: 223baf9d17f2 ("sched: Fix performance regression introduced by mm_cid") Cc: stable@vger.kernel.org # 6.4.x Cc: Mathieu Desnoyers mathieu.desnoyers@efficios.com Cc: Catalin Marinas catalin.marinas@arm.com Cc: Mark Rutland mark.rutland@arm.com Cc: Will Deacon will@kernel.org Cc: Peter Zijlstra peterz@infradead.org Cc: Aaron Lu aaron.lu@intel.com

I'm really sorry if you got this multiple times. I had some problems with the SMTP server...

arch/arm64/mm/context.c | 5 +++++ 1 file changed, 5 insertions(+)

diff --git a/arch/arm64/mm/context.c b/arch/arm64/mm/context.c index 188197590fc9..7a9e8e6647a0 100644 --- a/arch/arm64/mm/context.c +++ b/arch/arm64/mm/context.c @@ -268,6 +268,11 @@ void check_and_switch_context(struct mm_struct *mm) */ if (!system_uses_ttbr0_pan()) cpu_switch_mm(mm->pgd, mm);
/*
* See the comments on switch_mm_cid describing user -> user transition.
*/
smp_mb(); }

unsigned long arm64_mm_context_get(struct mm_struct *mm)
-- LEVI:{C3F47F37-75D8-414A-A8BA-3980EC8A46D7}

-- Mathieu Desnoyers EfficiOS Inc. https://www.efficios.com

levi.yun

9:07 p.m.

Hi Mathieu!

On 05/03/2024 20:01, Mathieu Desnoyers wrote:

...

On 2024-03-05 09:53, levi.yun wrote:

...
Currently arm64's switch_mm() doesn't always have an smp_mb() which the core scheduler code has depended upon since commit:

commit 223baf9d17f25 ("sched: Fix performance regression introduced by mm_cid")

If switch_mm() doesn't call smp_mb(), sched_mm_cid_remote_clear() can unset the activly used cid when it fails to observe active task after it sets lazy_put.

By adding an smp_mb() in arm64's check_and_switch_context(), Guarantee to observe active task after sched_mm_cid_remote_clear() success to set lazy_put.

This comment from the original implementation of membarrier MEMBARRIER_CMD_PRIVATE_EXPEDITED states that the original need from membarrier was to have a full barrier between storing to rq->curr and return to userspace:

commit 22e4ebb9758 ("membarrier: Provide expedited private command")

commit message:

* Our TSO archs can do RELEASE without being a full barrier. Look at       x86 spin_unlock() being a regular STORE for example. But for those       archs, all atomics imply smp_mb and all of them have atomic ops in       switch_mm() for mm_cpumask(), and on x86 the CR3 load acts as a full       barrier.         * From all weakly ordered machines, only ARM64 and PPC can do RELEASE,       the rest does indeed do smp_mb(), so there the spin_unlock() is a full       barrier and we're good.         * ARM64 has a very heavy barrier in switch_to(), which suffices.         * PPC just removed its barrier from switch_to(), but appears to be       talking about adding something to switch_mm(). So add a       smp_mb__after_unlock_lock() for now, until this is settled on the PPC       side.

associated code:

+               /* +                * The membarrier system call requires each architecture +                * to have a full memory barrier after updating +                * rq->curr, before returning to user-space. For TSO +                * (e.g. x86), the architecture must provide its own +                * barrier in switch_mm(). For weakly ordered machines +                * for which spin_unlock() acts as a full memory +                * barrier, finish_lock_switch() in common code takes +                * care of this barrier. For weakly ordered machines for +                * which spin_unlock() acts as a RELEASE barrier (only +                * arm64 and PowerPC), arm64 has a full barrier in +                * switch_to(), and PowerPC has +                * smp_mb__after_unlock_lock() before +                * finish_lock_switch(). +                */

Which got updated to this by

commit 306e060435d ("membarrier: Document scheduler barrier requirements")

/*                  * The membarrier system call requires each architecture                  * to have a full memory barrier after updating +                * rq->curr, before returning to user-space. +                * +                * Here are the schemes providing that barrier on the +                * various architectures: +                * - mm ? switch_mm() : mmdrop() for x86, s390, sparc, PowerPC. +                *   switch_mm() rely on membarrier_arch_switch_mm() on PowerPC. +                * - finish_lock_switch() for weakly-ordered +                *   architectures where spin_unlock is a full barrier, +                * - switch_to() for arm64 (weakly-ordered, spin_unlock +                *   is a RELEASE barrier),                  */

However, rseq mm_cid has stricter requirements: the barrier needs to be issued between store to rq->curr and switch_mm_cid(), which happens earlier than:

spin_unlock(),

switch_to().

So it's fine when the architecture switch_mm happens to have that barrier already, but less so when the architecture only provides the full barrier in switch_to() or spin_unlock().

The issue is therefore not specific to arm64, it's actually a bug in the rseq switch_mm_cid() implementation. All architectures that don't have memory barriers in switch_mm(), but rather have the full barrier either in finish_lock_switch() or switch_to() have them too late for the needs of switch_mm_cid().

Thanks for the great detail explain!

...

I would recommend one of three approaches here:

A) Add smp_mb() in switch_mm_cid() for all architectures that lack that barrier in switch_mm().

B) Figure out if we can move switch_mm_cid() further down in the scheduler without breaking anything (within switch_to(), at the very end of finish_lock_switch() for instance). I'm not sure we can do that though because switch_mm_cid() touches the "prev" which is tricky after switch_to().

C) Add barriers in switch_mm() within all architectures that are missing it.

Thoughts ?

IMHO, A) is look good to me.

Because, In case of B), If you assume spin_unlock() for rq->lock has full memory barrier, I'm not sure about the architecture which using queued_spin_unlock().

When I see the queued_spin_unlock()'s implementation, It implements using smp_store_relasse(). But, when we see the memory_barrier.txt describing MULTICOPY ATOMICITY, If smp_mb__after_atomic() is implemented with smp_mb(), There might fail to observe.

Am I wrong?

Many thanks!

673

days inactive

673

days old

linux-stable-mirror@lists.linaro.org

4 comments

participants

tags (0)

participants (3)

levi.yun
Mathieu Desnoyers
Will Deacon