It all started with this:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/791552
basically, switching toolchain from 4.5 to 4.6, somehow broke usb on omap3.
This morning i tested with linux-linaro-natty:
[flag@newluxor linux-linaro-natty]$ git log --oneline -1 f15fd8f LINARO: Linux-linaro-2.6.38-1002.3
and i can reproduce the problem there too.
Here is some more info (dmesg, toolchain, lsusb, etcetc):
http://pastebin.ubuntu.com/620786/
On 06/07/2011 01:09 PM, Paolo Pisati wrote:
It all started with this:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/791552
basically, switching toolchain from 4.5 to 4.6, somehow broke usb on omap3.
This morning i tested with linux-linaro-natty:
[flag@newluxor linux-linaro-natty]$ git log --oneline -1 f15fd8f LINARO: Linux-linaro-2.6.38-1002.3
and i can reproduce the problem there too.
Here is some more info (dmesg, toolchain, lsusb, etcetc):
http://people.canonical.com/~ppisati/lp791552/
full dmesg and precompiled kernel to test.
On Tue, Jun 7, 2011 at 11:09 PM, Paolo Pisati paolo.pisati@canonical.com wrote:
It all started with this:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/791552
basically, switching toolchain from 4.5 to 4.6, somehow broke usb on omap3.
This morning i tested with linux-linaro-natty:
[flag@newluxor linux-linaro-natty]$ git log --oneline -1 f15fd8f LINARO: Linux-linaro-2.6.38-1002.3
and i can reproduce the problem there too.
Here is some more info (dmesg, toolchain, lsusb, etcetc):
Hi Paolo. Could you please treat this as an application problem and try to track it down further? It would be great if you could track it down to an individual function or chunk of mis-compiled code and supply the preprocessed source that goes with it, then we can jump right onto it.
Does the same fault occur with a FSF 4.6.0 toolchain?
-- Michael
On 06/08/2011 12:29 AM, Michael Hope wrote:
Hi Paolo. Could you please treat this as an application problem and try to track it down further? It would be great if you could track it down to an individual function or chunk of mis-compiled code and supply the preprocessed source that goes with it, then we can jump right onto it.
Does the same fault occur with a FSF 4.6.0 toolchain?
I didn't know where to find a vanilla FSF 4.6 toolchain for arm, so i rolled my own 4.6 toolchain using the Debian 4.6 source packages inside a sid chroot and i can confirm the problem is there too.
Meanwhile while googling around i found i was not the only one experiencing this issue:
https://lists.yoctoproject.org/pipermail/poky/2011-April/005703.html
and
https://lists.yoctoproject.org/pipermail/poky/2011-May/005763.html
then the thread continued in June with some disassembly/low level debugging and a couple of workarounds (neither of them worked for me anyway):
https://lists.yoctoproject.org/pipermail/poky/2011-June/006634.html
Last but not least, i can confim this issue is present in linaro, ubuntu and vanilla kernels going back to at least .35.x (actually the only .35 i tested was ubuntu maverick), and it affects omap4 too.
I couldn't test 3.0 since it kernel panics at boot (but that should be another issue).
On Thu, Jun 16, 2011 at 3:04 AM, Paolo Pisati p.pisati@gmail.com wrote:
On 06/08/2011 12:29 AM, Michael Hope wrote:
Hi Paolo. Could you please treat this as an application problem and try to track it down further? It would be great if you could track it down to an individual function or chunk of mis-compiled code and supply the preprocessed source that goes with it, then we can jump right onto it.
Does the same fault occur with a FSF 4.6.0 toolchain?
I didn't know where to find a vanilla FSF 4.6 toolchain for arm, so i rolled my own 4.6 toolchain using the Debian 4.6 source packages inside a sid chroot and i can confirm the problem is there too.
Meanwhile while googling around i found i was not the only one experiencing this issue:
https://lists.yoctoproject.org/pipermail/poky/2011-April/005703.html
and
https://lists.yoctoproject.org/pipermail/poky/2011-May/005763.html
then the thread continued in June with some disassembly/low level debugging and a couple of workarounds (neither of them worked for me anyway):
https://lists.yoctoproject.org/pipermail/poky/2011-June/006634.html
Last but not least, i can confim this issue is present in linaro, ubuntu and vanilla kernels going back to at least .35.x (actually the only .35 i tested was ubuntu maverick), and it affects omap4 too.
I couldn't test 3.0 since it kernel panics at boot (but that should be another issue).
Hi Paolo. I've had a look into this and updated the ticket. The problem is due to a change in behaviour between 4.5 and 4.6 when accessing strcuts marked with __attribute__((packed)). 4.5 would do a word-wise access where 4.6 assumes the worst and does a byte-by-byte access.
There was a discussion on the GCC list: http://gcc.gnu.org/ml/gcc/2011-02/msg00035.html
which petered out unfortunately. IMO the compiler is working correctly and the problem should be fixed in the kernel initially and then a long term plan worked out between the kernel and compiler.
Arnd, do you want to pick up that thread? Both Ubuntu and Debian are now 4.6 based so this problem will come up more often. I think Patrik said there's a fix for this specific problem pending.
-- Michael
On Friday 17 June 2011, Michael Hope wrote:
Hi Paolo. I've had a look into this and updated the ticket. The problem is due to a change in behaviour between 4.5 and 4.6 when accessing strcuts marked with __attribute__((packed)). 4.5 would do a word-wise access where 4.6 assumes the worst and does a byte-by-byte access.
There was a discussion on the GCC list: http://gcc.gnu.org/ml/gcc/2011-02/msg00035.html
which petered out unfortunately. IMO the compiler is working correctly and the problem should be fixed in the kernel initially and then a long term plan worked out between the kernel and compiler.
Arnd, do you want to pick up that thread? Both Ubuntu and Debian are now 4.6 based so this problem will come up more often. I think Patrik said there's a fix for this specific problem pending.
I fear Russell will not like this too much, but I guess it has to be done. It will also come back to the toolchain team as a request to make the inline assembly syntax more helpful. When we last talked about it on the mailing list, the reason that was given for having writel() not do an inline assembly to access I/O memory was that we cannot express all possible addressing modes well.
The function in question is
#define __raw_writel(v,a) (*(volatile unsigned int __force *)(a) = (v))
AFAICT, this should really be written as
static inline void __raw_writel(unsigned int v, volatile __iomem unsigned int *a) { asm volatile("str %0, %1" : "=o" (*a) : "r" (v) : "memory"); }
What I don't know is whether this form is ideal in all cases (or even correct), and what the constraints should be. If I understood Uli correctly, it should be possible to express this with gcc syntax, but of course there may be a catch somewhere.
We need all six forms of this function, for read/write and 8/16/32 bit accesses.
Arnd
On Fri, 17 Jun 2011, Arnd Bergmann wrote:
On Friday 17 June 2011, Michael Hope wrote:
Hi Paolo. I've had a look into this and updated the ticket. The problem is due to a change in behaviour between 4.5 and 4.6 when accessing strcuts marked with __attribute__((packed)). 4.5 would do a word-wise access where 4.6 assumes the worst and does a byte-by-byte access.
There was a discussion on the GCC list: http://gcc.gnu.org/ml/gcc/2011-02/msg00035.html
which petered out unfortunately. IMO the compiler is working correctly and the problem should be fixed in the kernel initially and then a long term plan worked out between the kernel and compiler.
Arnd, do you want to pick up that thread? Both Ubuntu and Debian are now 4.6 based so this problem will come up more often. I think Patrik said there's a fix for this specific problem pending.
I fear Russell will not like this too much, but I guess it has to be done. It will also come back to the toolchain team as a request to make the inline assembly syntax more helpful. When we last talked about it on the mailing list, the reason that was given for having writel() not do an inline assembly to access I/O memory was that we cannot express all possible addressing modes well.
I played with this a bit. I think that GCC might have improved on this front. It at least doesn't seem to produce the much suboptimal code compared to the plain C implementation as it used to do in the past. See my explanation and experiments below.
The function in question is
#define __raw_writel(v,a) (*(volatile unsigned int __force *)(a) = (v))
AFAICT, this should really be written as
static inline void __raw_writel(unsigned int v, volatile __iomem unsigned int *a) { asm volatile("str %0, %1" : "=o" (*a) : "r" (v) : "memory"); }
What I don't know is whether this form is ideal in all cases (or even correct), and what the constraints should be.
The problem here is that the assembly constraint doesn't let us specify which type of access this constraint is for.
Let's take the str instruction as an example. It may use many different addressing modes:
str r0, [r1] # destination address in r1 str r0, [r1, r2] # base address in r1 with offset in r2 str r0, [r1, #<i>] # base address in r1 with an immediate offset
The third form is the most common in driver code. You have the iobase in a variable, used with register offsets. So, for example, if you have this code:
#define REG_A 0x00 #define REG_B 0x04 #define REG_C 0x08 #define REG_D 0x0c #define REG_E 0x800
#define writel(v,a) (*(volatile unsigned int *)(a) = (v))
void foobar_init(void * iobase) { /* initialize A, C, and E regs to zero */ writel(0, iobase + REG_A); writel(0, iobase + REG_C); writel(0, iobase + REG_E);
/* initialize B and D to -1 */ writel(-1, iobase + REG_B); writel(-1, iobase + REG_D); }
This compiles to:
foobar_init: mov r3, #0 mvn r2, #0 str r3, [r0, #0] str r3, [r0, #8] str r3, [r0, #2048] str r2, [r0, #4] str r2, [r0, #12] bx lr
Now the writel could have been written as a static inline:
static inline void writel(unsigned int v, volatile unsigned int *a) { *a = v; }
The result is the same as gcc is smart enough to share the common iobase and spread the immediate offsets. This also provides better type checking. But last time this was attempted, a large amount of code was then complaining due to the fact that sometimes the address is provided as a pointer, sometimes as an integer. this deserves to be cleaned, but this should really be orthogonal to this issue and the move to a static inline be done first, and separately from the actual access implementation.
Now let's try your suggestion, with a minor ordering fix:
static inline void writel(unsigned int v, volatile unsigned int *a) { asm volatile ("str\t%1, %0" : : "=o" (*a) : "r" (v)); }
The result is then:
foobar_init: mov r1, #0 str r1, [r0, #0] str r1, [r0, #8] str r1, [r0, #2048] mvn r3, #0 str r3, [r0, #4] str r3, [r0, #12] bx lr
Not obvious consequences here, but the scheduling is different. I suspect this would be more important in the readl() case as every ldr instructions have result delay slots to fill with other instructions when possible. I don't know if the compiler is able to infer this from the constraint.
Arguably, this might not be significant given that IO accesses are typically much more costly with much longer stalls than a delay slot might be.
But more importantly, the immediate offset has a limited range of course. But the problem is about the fact that not every store instructions have the same immediate offset range in their addressing mode. For example, the strh instruction may accommodate only a 8-bit offset compared to 12-bit offset for a word.
Oh wait! Looks like the o constraint is smart enough to adapt to the type of the pointer. So if the pointer for a is a volatile unsigned short then we get this:
foobar_init: mov r1, #0 str r1, [r0, #0] str r1, [r0, #8] add r2, r0, #2048 str r1, [r2, #0] mvn r3, #0 str r3, [r0, #4] str r3, [r0, #12] bx lr
Obviously the inline asm should have used a strh instruction as well, but what I actually wanted to verify here is the restriction on the indexed addressing range. Looks like GCC is doing the right thing here. I can't say for sure, but I have a vague recollection of GCC doing it wrong in some distant past. At the time the only thing that seemed to be OK in all cases with an inline asm implementation was:
static inline void writeh(unsigned int v, volatile unsigned short *a) { asm volatile ("strh\t%1, [%0]" : : "r" (a), "r" (v)); }
which produced this:
foobar_init: mov r1, #0 strh r1, [r0] mov ip, #0 add r2, r0, #8 strh ip, [r2] add r3, r0, #2048 strh ip, [r3] mvn r2, #0 add r1, r0, #4 strh r2, [r1] mvn r3, #0 add r0, r0, #12 strh r3, [r0] bx lr
As you can see, this is far less optimal and larger. And then GCC even gets confused wrt liveness of the value to store despite using -O2. Hence the implementation using pure C which produced far better code.
If I understood Uli correctly, it should be possible to express this with gcc syntax, but of course there may be a catch somewhere.
Yes. You don't want to use a memory clobber. That tells GCC to forget everything about what it has cached into registers and forces it to reload everything from memory, including the iobase pointer if it comes from a struct device's private data.
Consider, for example, this code:
struct device { void *iobase; };
void foobar_init(struct device *dev) { /* initialize A, C, and E regs to zero */ writel(0, dev->iobase + REG_A); writel(0, dev->iobase + REG_C); writel(0, dev->iobase + REG_E);
/* initialize B and D to -1 */ writel(-1, dev->iobase + REG_B); writel(-1, dev->iobase + REG_D); }
If you keep the "memory" clobber, GCC will produce this:
foobar_init: ldr r1, [r0, #0] mov r3, #0 str r3, [r1, #0] ldr r2, [r0, #0] str r3, [r2, #8] ldr ip, [r0, #0] str r3, [ip, #2048] mvn r2, #0 ldr r1, [r0, #0] str r2, [r1, #4] ldr r3, [r0, #0] str r2, [r3, #12] bx lr
As you can see, the iobase value is reloaded all the time needlessly. Another recollection of mine is that a while ago GCC was considering static inline functions like full memory barriers too, which doesn't seem to be the case anymore.
But for the constraints the following should be best:
static inline void writel(unsigned int v, volatile unsigned int *a) { asm volatile ("str\t%1, %0" : : "+o" (*a) : "r" (v)); }
static inline unsigned int readl(volatile unsigned int *a) { unsigned int v; asm volatile ("ldr\t%1, %0" : : "+o" (*a), "=r" (v)); return v; }
The bidirectional qualifier indicates that we're going to modify the memory pointed by a, and yet it is going to be different after that, which pretty much matches how hardware registers behave.
Nicolas
On Saturday 18 June 2011, Nicolas Pitre wrote:
On Fri, 17 Jun 2011, Arnd Bergmann wrote:
I played with this a bit.
It looks like much more than a bit, thanks for the detailed analysis!
I think that GCC might have improved on this front. It at least doesn't seem to produce the much suboptimal code compared to the plain C implementation as it used to do in the past. See my explanation and experiments below.
If everything else fails, we can make it depend on the gcc version and e.g. use the inline assembly with 4.5 and higher, while we use the C code that used to be fine with older versions.
The function in question is
#define __raw_writel(v,a) (*(volatile unsigned int __force *)(a) = (v))
AFAICT, this should really be written as
static inline void __raw_writel(unsigned int v, volatile __iomem unsigned int *a) { asm volatile("str %0, %1" : "=o" (*a) : "r" (v) : "memory"); }
What I don't know is whether this form is ideal in all cases (or even correct), and what the constraints should be.
The problem here is that the assembly constraint doesn't let us specify which type of access this constraint is for.
Let's take the str instruction as an example. It may use many different addressing modes:
str r0, [r1] # destination address in r1 str r0, [r1, r2] # base address in r1 with offset in r2 str r0, [r1, #<i>] # base address in r1 with an immediate offset
I believe that they are all part of the "o" constraint, which includes all offsettable memory addresses.
The result is the same as gcc is smart enough to share the common iobase and spread the immediate offsets. This also provides better type checking. But last time this was attempted, a large amount of code was then complaining due to the fact that sometimes the address is provided as a pointer, sometimes as an integer. this deserves to be cleaned, but this should really be orthogonal to this issue and the move to a static inline be done first, and separately from the actual access implementation.
Good point. A lot of architectures have strict type checking here already, so I don't think this will be much of a problem.
Looks like GCC is doing the right thing here. I can't say for sure, but I have a vague recollection of GCC doing it wrong in some distant past. At the time the only thing that seemed to be OK in all cases with an inline asm implementation was:
static inline void writeh(unsigned int v, volatile unsigned short *a) { asm volatile ("strh\t%1, [%0]" : : "r" (a), "r" (v)); }
This would better use the '"=Q" (a)' form, to tell the compiler that we are also writing into a. We cannot use the "=m" form because that contains the pre/post-increment variants though.
which produced this:
foobar_init: mov r1, #0 strh r1, [r0] mov ip, #0 add r2, r0, #8 strh ip, [r2] add r3, r0, #2048 strh ip, [r3] mvn r2, #0 add r1, r0, #4 strh r2, [r1] mvn r3, #0 add r0, r0, #12 strh r3, [r0] bx lr
As you can see, this is far less optimal and larger. And then GCC even gets confused wrt liveness of the value to store despite using -O2. Hence the implementation using pure C which produced far better code.
Yes, it would be good to avoid this.
If I understood Uli correctly, it should be possible to express this with gcc syntax, but of course there may be a catch somewhere.
Yes. You don't want to use a memory clobber. That tells GCC to forget everything about what it has cached into registers and forces it to reload everything from memory, including the iobase pointer if it comes from a struct device's private data.
Well, in case of readl/writel, we need the memory clobber for other reasons: They might trigger a DMA operation, or the readl might be used to wait for completion of a DMA operation. In either case, you need to have a barrier to prevent the reordering of the MMIO instruction w.r.t the DMA memory that it is referring to. On in-order CPUs, this does not always require a synchronisation instruction, but it strill requires a compiler barrier, which is typically expressed as a full memory clobber.
As far as I can tell, we don't need the memory clobber in case of readl_relaxed()/writel_relaxed(), but we need it for the non-relaxed versions.
As you can see, the iobase value is reloaded all the time needlessly. Another recollection of mine is that a while ago GCC was considering static inline functions like full memory barriers too, which doesn't seem to be the case anymore.
But for the constraints the following should be best:
static inline void writel(unsigned int v, volatile unsigned int *a) { asm volatile ("str\t%1, %0" : : "+o" (*a) : "r" (v)); }
static inline unsigned int readl(volatile unsigned int *a) { unsigned int v; asm volatile ("ldr\t%1, %0" : : "+o" (*a), "=r" (v)); return v; }
The bidirectional qualifier indicates that we're going to modify the memory pointed by a, and yet it is going to be different after that, which pretty much matches how hardware registers behave.
I don't see how the "+" is needed here, because the compiler won't be accessing the pointers otherwise, and doesn't have the option of omitting any accesses either.
Arnd
On Mon, 20 Jun 2011, Arnd Bergmann wrote:
On Saturday 18 June 2011, Nicolas Pitre wrote:
On Fri, 17 Jun 2011, Arnd Bergmann wrote:
If I understood Uli correctly, it should be possible to express this with gcc syntax, but of course there may be a catch somewhere.
Yes. You don't want to use a memory clobber. That tells GCC to forget everything about what it has cached into registers and forces it to reload everything from memory, including the iobase pointer if it comes from a struct device's private data.
Well, in case of readl/writel, we need the memory clobber for other reasons: They might trigger a DMA operation, or the readl might be used to wait for completion of a DMA operation. In either case, you need to have a barrier to prevent the reordering of the MMIO instruction w.r.t the DMA memory that it is referring to. On in-order CPUs, this does not always require a synchronisation instruction, but it strill requires a compiler barrier, which is typically expressed as a full memory clobber.
We don't have any of that now though.
And I think this would be much more sensible if the barrier was explicit, so not to generate awful code as I demonstrated, in the vast majority of the cases where there is no such issues.
And the fact that the inline asm is marked volatile should be enough to prevent reordering already, isn't it?
But for the constraints the following should be best:
static inline void writel(unsigned int v, volatile unsigned int *a) { asm volatile ("str\t%1, %0" : : "+o" (*a) : "r" (v)); }
static inline unsigned int readl(volatile unsigned int *a) { unsigned int v; asm volatile ("ldr\t%1, %0" : : "+o" (*a), "=r" (v)); return v; }
The bidirectional qualifier indicates that we're going to modify the memory pointed by a, and yet it is going to be different after that, which pretty much matches how hardware registers behave.
I don't see how the "+" is needed here, because the compiler won't be accessing the pointers otherwise, and doesn't have the option of omitting any accesses either.
It could prevents the compiler from assuming that the memory content at that location is still containing the same data that was just written/read, even if the volatile was omitted (note I'm not suggesting to remove 'volatile').
Nicolas
On Monday 20 June 2011, Nicolas Pitre wrote:
On Mon, 20 Jun 2011, Arnd Bergmann wrote:
On Saturday 18 June 2011, Nicolas Pitre wrote: Well, in case of readl/writel, we need the memory clobber for other reasons: They might trigger a DMA operation, or the readl might be used to wait for completion of a DMA operation. In either case, you need to have a barrier to prevent the reordering of the MMIO instruction w.r.t the DMA memory that it is referring to. On in-order CPUs, this does not always require a synchronisation instruction, but it strill requires a compiler barrier, which is typically expressed as a full memory clobber.
We don't have any of that now though.
We do have __iormb() in there, which is both a CPU and a compiler barrier. Unfortunately, in case of !CONFIG_ARM_DMA_MEM_BUFFERABLE it degrades to no barrier at all, which is a bug.
And I think this would be much more sensible if the barrier was explicit, so not to generate awful code as I demonstrated, in the vast majority of the cases where there is no such issues.
Yes, we can recommend to people to use the _relaxed versions, especially for drivers that don't do any DMA.
And the fact that the inline asm is marked volatile should be enough to prevent reordering already, isn't it?
No. It only makes sure that the asm will get executed, but does not prevent the compiler from reordering other statements around it.
static inline unsigned int readl(volatile unsigned int *a) { unsigned int v; asm volatile ("ldr\t%1, %0" : : "+o" (*a), "=r" (v)); return v; }
The bidirectional qualifier indicates that we're going to modify the memory pointed by a, and yet it is going to be different after that, which pretty much matches how hardware registers behave.
I don't see how the "+" is needed here, because the compiler won't be accessing the pointers otherwise, and doesn't have the option of omitting any accesses either.
It could prevents the compiler from assuming that the memory content at that location is still containing the same data that was just written/read, even if the volatile was omitted (note I'm not suggesting to remove 'volatile').
But the compiler doesn't know teh contents of the pointer anyway, so even if it assumes that it hasn't changed, that won't mean anything at all.
Arnd
On Tue, Jun 21, 2011 at 4:11 AM, Arnd Bergmann arnd.bergmann@linaro.org wrote:
On Monday 20 June 2011, Nicolas Pitre wrote:
On Mon, 20 Jun 2011, Arnd Bergmann wrote:
On Saturday 18 June 2011, Nicolas Pitre wrote: Well, in case of readl/writel, we need the memory clobber for other reasons: They might trigger a DMA operation, or the readl might be used to wait for completion of a DMA operation. In either case, you need to have a barrier to prevent the reordering of the MMIO instruction w.r.t the DMA memory that it is referring to. On in-order CPUs, this does not always require a synchronisation instruction, but it strill requires a compiler barrier, which is typically expressed as a full memory clobber.
We don't have any of that now though.
We do have __iormb() in there, which is both a CPU and a compiler barrier. Unfortunately, in case of !CONFIG_ARM_DMA_MEM_BUFFERABLE it degrades to no barrier at all, which is a bug.
And I think this would be much more sensible if the barrier was explicit, so not to generate awful code as I demonstrated, in the vast majority of the cases where there is no such issues.
Yes, we can recommend to people to use the _relaxed versions, especially for drivers that don't do any DMA.
And the fact that the inline asm is marked volatile should be enough to prevent reordering already, isn't it?
No. It only makes sure that the asm will get executed, but does not prevent the compiler from reordering other statements around it.
static inline unsigned int readl(volatile unsigned int *a) { unsigned int v; asm volatile ("ldr\t%1, %0" : : "+o" (*a), "=r" (v)); return v; }
The bidirectional qualifier indicates that we're going to modify the memory pointed by a, and yet it is going to be different after that, which pretty much matches how hardware registers behave.
I don't see how the "+" is needed here, because the compiler won't be accessing the pointers otherwise, and doesn't have the option of omitting any accesses either.
It could prevents the compiler from assuming that the memory content at that location is still containing the same data that was just written/read, even if the volatile was omitted (note I'm not suggesting to remove 'volatile').
But the compiler doesn't know teh contents of the pointer anyway, so even if it assumes that it hasn't changed, that won't mean anything at all.
Hi everyone. Sorry for the delay in replying.
To summarise the thread: * The kernel uses writel() and readl() for accessing device registers * These must result in a single memory access * They enforce memory ordering and require the programmer to ensure alignment * The ARM asm/io.h uses C constructs for read and write * The behaviour of __attribute__((packed)) changed in GCC 4.6 causing some work accesses to be split into a byte-by-byte unaligned access * Most other architectures use inline assembly * ARM doesn't use inline assembly as the performance was poor in the past. * Inline assembly in recent GCC is correct and performs well
Based on Nico's reply, I had a play with the different ways of implementing register access. I tried base + offset such as:
void *iobase; #define REG_A 0x10
...and struct-based layout:
struct regs { unsigned int reg_a; unsigned int reg_b; };
and the different ways of holding the register pointer:
void init(void *iobase);
struct device { void *iobase; };
struct device { struct regs *regs; };
The relaxed inline assembly version works well in all cases including handling the limits of the different offset encodings. The strict inline assembly version with a memory barrier is poor where where iobase is inside a struct, but it's easy to work around by taking iobase into a register first such as:
void init(struct device *dev) { struct regs r = dev->regs; writel(0, r->reg_a); writel(42, r->reg_b); ...
I see no reason not to switch writel() and readl() to inline assembly. This gives you control over the otherwise undefined behaviour.
On a separate note, I'm hoping to land the unaligned access patches into next month's Linaro GCC 4.6 release. This would fix the problem on ARMv7 architectures but leave it lurking on earlier ones.
It seems that GCC is currently good enough for the job. The next steps on the kernel side are to audit any __attribute__((packed))'s, check __io{r,w}mb() when CONFIG_ARM_DMA_MEM_BUFFERABLE=n, and to trial changing to inline assembly.
Let me know if we can help.
-- Michael
linaro-toolchain@lists.linaro.org