On Fri, 17 Jun 2011, Arnd Bergmann wrote:
On Friday 17 June 2011, Michael Hope wrote:
Hi Paolo. I've had a look into this and updated the ticket. The problem is due to a change in behaviour between 4.5 and 4.6 when accessing strcuts marked with __attribute__((packed)). 4.5 would do a word-wise access where 4.6 assumes the worst and does a byte-by-byte access.
There was a discussion on the GCC list: http://gcc.gnu.org/ml/gcc/2011-02/msg00035.html
which petered out unfortunately. IMO the compiler is working correctly and the problem should be fixed in the kernel initially and then a long term plan worked out between the kernel and compiler.
Arnd, do you want to pick up that thread? Both Ubuntu and Debian are now 4.6 based so this problem will come up more often. I think Patrik said there's a fix for this specific problem pending.
I fear Russell will not like this too much, but I guess it has to be done. It will also come back to the toolchain team as a request to make the inline assembly syntax more helpful. When we last talked about it on the mailing list, the reason that was given for having writel() not do an inline assembly to access I/O memory was that we cannot express all possible addressing modes well.
I played with this a bit. I think that GCC might have improved on this front. It at least doesn't seem to produce the much suboptimal code compared to the plain C implementation as it used to do in the past. See my explanation and experiments below.
The function in question is
#define __raw_writel(v,a) (*(volatile unsigned int __force *)(a) = (v))
AFAICT, this should really be written as
static inline void __raw_writel(unsigned int v, volatile __iomem unsigned int *a) { asm volatile("str %0, %1" : "=o" (*a) : "r" (v) : "memory"); }
What I don't know is whether this form is ideal in all cases (or even correct), and what the constraints should be.
The problem here is that the assembly constraint doesn't let us specify which type of access this constraint is for.
Let's take the str instruction as an example. It may use many different addressing modes:
str r0, [r1] # destination address in r1 str r0, [r1, r2] # base address in r1 with offset in r2 str r0, [r1, #<i>] # base address in r1 with an immediate offset
The third form is the most common in driver code. You have the iobase in a variable, used with register offsets. So, for example, if you have this code:
#define REG_A 0x00 #define REG_B 0x04 #define REG_C 0x08 #define REG_D 0x0c #define REG_E 0x800
#define writel(v,a) (*(volatile unsigned int *)(a) = (v))
void foobar_init(void * iobase) { /* initialize A, C, and E regs to zero */ writel(0, iobase + REG_A); writel(0, iobase + REG_C); writel(0, iobase + REG_E);
/* initialize B and D to -1 */ writel(-1, iobase + REG_B); writel(-1, iobase + REG_D); }
This compiles to:
foobar_init: mov r3, #0 mvn r2, #0 str r3, [r0, #0] str r3, [r0, #8] str r3, [r0, #2048] str r2, [r0, #4] str r2, [r0, #12] bx lr
Now the writel could have been written as a static inline:
static inline void writel(unsigned int v, volatile unsigned int *a) { *a = v; }
The result is the same as gcc is smart enough to share the common iobase and spread the immediate offsets. This also provides better type checking. But last time this was attempted, a large amount of code was then complaining due to the fact that sometimes the address is provided as a pointer, sometimes as an integer. this deserves to be cleaned, but this should really be orthogonal to this issue and the move to a static inline be done first, and separately from the actual access implementation.
Now let's try your suggestion, with a minor ordering fix:
static inline void writel(unsigned int v, volatile unsigned int *a) { asm volatile ("str\t%1, %0" : : "=o" (*a) : "r" (v)); }
The result is then:
foobar_init: mov r1, #0 str r1, [r0, #0] str r1, [r0, #8] str r1, [r0, #2048] mvn r3, #0 str r3, [r0, #4] str r3, [r0, #12] bx lr
Not obvious consequences here, but the scheduling is different. I suspect this would be more important in the readl() case as every ldr instructions have result delay slots to fill with other instructions when possible. I don't know if the compiler is able to infer this from the constraint.
Arguably, this might not be significant given that IO accesses are typically much more costly with much longer stalls than a delay slot might be.
But more importantly, the immediate offset has a limited range of course. But the problem is about the fact that not every store instructions have the same immediate offset range in their addressing mode. For example, the strh instruction may accommodate only a 8-bit offset compared to 12-bit offset for a word.
Oh wait! Looks like the o constraint is smart enough to adapt to the type of the pointer. So if the pointer for a is a volatile unsigned short then we get this:
foobar_init: mov r1, #0 str r1, [r0, #0] str r1, [r0, #8] add r2, r0, #2048 str r1, [r2, #0] mvn r3, #0 str r3, [r0, #4] str r3, [r0, #12] bx lr
Obviously the inline asm should have used a strh instruction as well, but what I actually wanted to verify here is the restriction on the indexed addressing range. Looks like GCC is doing the right thing here. I can't say for sure, but I have a vague recollection of GCC doing it wrong in some distant past. At the time the only thing that seemed to be OK in all cases with an inline asm implementation was:
static inline void writeh(unsigned int v, volatile unsigned short *a) { asm volatile ("strh\t%1, [%0]" : : "r" (a), "r" (v)); }
which produced this:
foobar_init: mov r1, #0 strh r1, [r0] mov ip, #0 add r2, r0, #8 strh ip, [r2] add r3, r0, #2048 strh ip, [r3] mvn r2, #0 add r1, r0, #4 strh r2, [r1] mvn r3, #0 add r0, r0, #12 strh r3, [r0] bx lr
As you can see, this is far less optimal and larger. And then GCC even gets confused wrt liveness of the value to store despite using -O2. Hence the implementation using pure C which produced far better code.
If I understood Uli correctly, it should be possible to express this with gcc syntax, but of course there may be a catch somewhere.
Yes. You don't want to use a memory clobber. That tells GCC to forget everything about what it has cached into registers and forces it to reload everything from memory, including the iobase pointer if it comes from a struct device's private data.
Consider, for example, this code:
struct device { void *iobase; };
void foobar_init(struct device *dev) { /* initialize A, C, and E regs to zero */ writel(0, dev->iobase + REG_A); writel(0, dev->iobase + REG_C); writel(0, dev->iobase + REG_E);
/* initialize B and D to -1 */ writel(-1, dev->iobase + REG_B); writel(-1, dev->iobase + REG_D); }
If you keep the "memory" clobber, GCC will produce this:
foobar_init: ldr r1, [r0, #0] mov r3, #0 str r3, [r1, #0] ldr r2, [r0, #0] str r3, [r2, #8] ldr ip, [r0, #0] str r3, [ip, #2048] mvn r2, #0 ldr r1, [r0, #0] str r2, [r1, #4] ldr r3, [r0, #0] str r2, [r3, #12] bx lr
As you can see, the iobase value is reloaded all the time needlessly. Another recollection of mine is that a while ago GCC was considering static inline functions like full memory barriers too, which doesn't seem to be the case anymore.
But for the constraints the following should be best:
static inline void writel(unsigned int v, volatile unsigned int *a) { asm volatile ("str\t%1, %0" : : "+o" (*a) : "r" (v)); }
static inline unsigned int readl(volatile unsigned int *a) { unsigned int v; asm volatile ("ldr\t%1, %0" : : "+o" (*a), "=r" (v)); return v; }
The bidirectional qualifier indicates that we're going to modify the memory pointed by a, and yet it is going to be different after that, which pretty much matches how hardware registers behave.
Nicolas