On Saturday 18 June 2011, Nicolas Pitre wrote:
On Fri, 17 Jun 2011, Arnd Bergmann wrote:
I played with this a bit.
It looks like much more than a bit, thanks for the detailed analysis!
I think that GCC might have improved on this front. It at least doesn't seem to produce the much suboptimal code compared to the plain C implementation as it used to do in the past. See my explanation and experiments below.
If everything else fails, we can make it depend on the gcc version and e.g. use the inline assembly with 4.5 and higher, while we use the C code that used to be fine with older versions.
The function in question is
#define __raw_writel(v,a) (*(volatile unsigned int __force *)(a) = (v))
AFAICT, this should really be written as
static inline void __raw_writel(unsigned int v, volatile __iomem unsigned int *a) { asm volatile("str %0, %1" : "=o" (*a) : "r" (v) : "memory"); }
What I don't know is whether this form is ideal in all cases (or even correct), and what the constraints should be.
The problem here is that the assembly constraint doesn't let us specify which type of access this constraint is for.
Let's take the str instruction as an example. It may use many different addressing modes:
str r0, [r1] # destination address in r1 str r0, [r1, r2] # base address in r1 with offset in r2 str r0, [r1, #<i>] # base address in r1 with an immediate offset
I believe that they are all part of the "o" constraint, which includes all offsettable memory addresses.
The result is the same as gcc is smart enough to share the common iobase and spread the immediate offsets. This also provides better type checking. But last time this was attempted, a large amount of code was then complaining due to the fact that sometimes the address is provided as a pointer, sometimes as an integer. this deserves to be cleaned, but this should really be orthogonal to this issue and the move to a static inline be done first, and separately from the actual access implementation.
Good point. A lot of architectures have strict type checking here already, so I don't think this will be much of a problem.
Looks like GCC is doing the right thing here. I can't say for sure, but I have a vague recollection of GCC doing it wrong in some distant past. At the time the only thing that seemed to be OK in all cases with an inline asm implementation was:
static inline void writeh(unsigned int v, volatile unsigned short *a) { asm volatile ("strh\t%1, [%0]" : : "r" (a), "r" (v)); }
This would better use the '"=Q" (a)' form, to tell the compiler that we are also writing into a. We cannot use the "=m" form because that contains the pre/post-increment variants though.
which produced this:
foobar_init: mov r1, #0 strh r1, [r0] mov ip, #0 add r2, r0, #8 strh ip, [r2] add r3, r0, #2048 strh ip, [r3] mvn r2, #0 add r1, r0, #4 strh r2, [r1] mvn r3, #0 add r0, r0, #12 strh r3, [r0] bx lr
As you can see, this is far less optimal and larger. And then GCC even gets confused wrt liveness of the value to store despite using -O2. Hence the implementation using pure C which produced far better code.
Yes, it would be good to avoid this.
If I understood Uli correctly, it should be possible to express this with gcc syntax, but of course there may be a catch somewhere.
Yes. You don't want to use a memory clobber. That tells GCC to forget everything about what it has cached into registers and forces it to reload everything from memory, including the iobase pointer if it comes from a struct device's private data.
Well, in case of readl/writel, we need the memory clobber for other reasons: They might trigger a DMA operation, or the readl might be used to wait for completion of a DMA operation. In either case, you need to have a barrier to prevent the reordering of the MMIO instruction w.r.t the DMA memory that it is referring to. On in-order CPUs, this does not always require a synchronisation instruction, but it strill requires a compiler barrier, which is typically expressed as a full memory clobber.
As far as I can tell, we don't need the memory clobber in case of readl_relaxed()/writel_relaxed(), but we need it for the non-relaxed versions.
As you can see, the iobase value is reloaded all the time needlessly. Another recollection of mine is that a while ago GCC was considering static inline functions like full memory barriers too, which doesn't seem to be the case anymore.
But for the constraints the following should be best:
static inline void writel(unsigned int v, volatile unsigned int *a) { asm volatile ("str\t%1, %0" : : "+o" (*a) : "r" (v)); }
static inline unsigned int readl(volatile unsigned int *a) { unsigned int v; asm volatile ("ldr\t%1, %0" : : "+o" (*a), "=r" (v)); return v; }
The bidirectional qualifier indicates that we're going to modify the memory pointed by a, and yet it is going to be different after that, which pretty much matches how hardware registers behave.
I don't see how the "+" is needed here, because the compiler won't be accessing the pointers otherwise, and doesn't have the option of omitting any accesses either.
Arnd