On Mon, Dec 28, 2020 at 4:36 PM Nicholas Piggin npiggin@gmail.com wrote:
Excerpts from Andy Lutomirski's message of December 29, 2020 7:06 am:
On Mon, Dec 28, 2020 at 12:32 PM Mathieu Desnoyers mathieu.desnoyers@efficios.com wrote:
----- On Dec 28, 2020, at 2:44 PM, Andy Lutomirski luto@kernel.org wrote:
On Mon, Dec 28, 2020 at 11:09 AM Russell King - ARM Linux admin linux@armlinux.org.uk wrote:
On Mon, Dec 28, 2020 at 07:29:34PM +0100, Jann Horn wrote:
After chatting with rmk about this (but without claiming that any of this is his opinion), based on the manpage, I think membarrier() currently doesn't really claim to be synchronizing caches? It just serializes cores. So arguably if userspace wants to use membarrier() to synchronize code changes, userspace should first do the code change, then flush icache as appropriate for the architecture, and then do the membarrier() to ensure that the old code is unused?
^ exactly, yes.
For 32-bit arm, rmk pointed out that that would be the cacheflush() syscall. That might cause you to end up with two IPIs instead of one in total, but we probably don't care _that_ much about extra IPIs on 32-bit arm?
This was the original thinking, yes. The cacheflush IPI will flush specific regions of code, and the membarrier IPI issues context synchronizing instructions.
APIs should be written in terms of the service they provide to userspace, and in highest level terms as possible, rather than directing hardware to do some low level operation. Unfortunately we're stuck with this for now. We could deprecate it and replace it though.
If userspace wants to modify code and ensure that after the system call returns then no other thread will be executing the previous code, then there should be an API for that. It could actually combine the two IPIs into one for architectures that require both too.
I agree. The membarrier API for SYNC_CORE is pretty nasty. I would much prefer a real API for JITs to use.
Architectures with coherent i/d caches don't need the cacheflush step.
There are different levels of coherency -- VIVT architectures may have differing requirements compared to PIPT, etc.
In any case, I feel like the approach taken by the documentation is fundamentally confusing. Architectures don't all speak the same language How about something like:
The SYNC_CORE operation causes all threads in the caller's address space (including the caller) to execute an architecture-defined barrier operation. membarrier() will ensure that this barrier is executed at a time such that all data writes done by the calling thread before membarrier() are made visible by the barrier. Additional architecture-dependent cache management operations may be required to use this for JIT code.
As said this isn't what SYNC_CORE does, and it's not what powerpc context synchronizing instructions do either, it will very much re-order visibility of stores around such an instruction.
Perhaps the docs should be entirely arch-specific. It may well be impossible to state what it does in an arch-neutral way.
A thread completes store instructions into a store queue, which is as far as a context synchronizing event goes. Visibility comes at some indeterminite time later.
As currently implemented, it has the same visibility semantics as regular membarrier, too. So if I do:
a = 1; membarrier(SYNC_CORE); b = 1;
and another thread does:
while (READ_ONCE(b) != 1) ; barrier(); assert(a == 1);
then the assertion will pass. Similarly, one can do this, I hope:
memcpy(codeptr, [some new instructions], len); arch_dependent_cache_flush(codeptr, len); membarrier(SYNC_CORE); ready = 1;
and another thread does:
while (READ_ONCE(ready) != 1) ; barrier(); (*codeptr)();
arch_dependent_cache_flush is a nop on x86. On arm and arm64, it appears to be a syscall, although maybe arm64 can do it from userspace. I still don't know what it is on powerpc.
Even using the term "cache" here is misleading. x86 chips have all kinds of barely-documented instruction caches, and they have varying degrees of coherency. The architecture actually promises that, if you do a certain incantation, then you get the desired result. membarrier() allows a user to do this incantation. But trying to replicate the incantation verbatim on an architecture like ARM is insufficient, and trying to flush the things that are documented as being caches on x86 is expensive and a complete waste of time on x86. When you write some JIT code, you do *not* want to flush it all the way to main memory, especially on CPUs don't have the ability to write back invalidating. (That's most CPUs.)
Even on x86, I suspect that the various decoded insn caches are rather more coherent than documented, and I have questions in to Intel about this. No answers yet.
So perhaps the right approach is to say that membarrier() helps you perform the architecture-specific sequence of steps needed to safely modify code. On x86, you use it like this. On arm64, you do this other thing. On powerpc, you do something else.
I would be surprised if x86's serializing instructions were different than powerpc. rdtsc ordering or flushing stores to cache would be surprising.
At the very least, x86 has several levels of what ARM might call "context synchronization" AFAICT. STAC, CLAC, and POPF do a form of context synchronization in that the changes they cause to the MMU take effect immediately, but they are not documented as synchronizing the instruction stream. "Serializing" instructions do all kinds of things, not all of which may be architecturally visible at all. MFENCE and LFENCE do various complicated things, and LFENCE has magic retroactive capabilities on old CPUs that were not documented when those CPUs were released. SFENCE does a different form of synchronization entirely. LOCK does something else. (The relationship between LOCK and MFENCE is confusing at best.) RDTSC doesn't serialize anything at all, but RDTSCP does provide a form of serialization that's kind of ilke LFENCE. It's a mess. Even the manuals are inconsistent about what "serialize" means. JMP has its own magic on x86, but only on very very old CPUs.
The specific instruction that flushes everything into the coherency domain is SFENCE, and SFENCE is not, for normal purposes, needed for self- or cross-modifying code.