By default ARCH_SLAB_MINALIGN is defined in "include/linux/slab.h" as "__alignof__(unsigned long long)" which looks fine but not for ARC.
ARC tools ABI sets align of "long long" the same as for "long" = 4 instead of 8 one may think of.
Thus slab allocator may easily allocate a buffer which is 32-bit aligned. And most of the time it's OK until we start dealing with 64-bit atomics with special LLOCKD/SCONDD instructions which (as opposed to their 32-bit counterparts LLOCK/SCOND) operate with full 64-bit words but those words must be 64-bit aligned.
Fixes Ext4 folder removal: --------------------->8------------------- [ 4.015732] EXT4-fs (mmcblk0p2): re-mounted. Opts: (null) [ 4.167881] [ 4.167881] Misaligned Access [ 4.172356] Path: /bin/busybox.nosuid [ 4.176004] CPU: 2 PID: 171 Comm: rm Not tainted 4.19.14-yocto-standard #1 [ 4.182851] [ 4.182851] [ECR ]: 0x000d0000 => Check Programmer's Manual [ 4.190061] [EFA ]: 0xbeaec3fc [ 4.190061] [BLINK ]: ext4_delete_entry+0x210/0x234 [ 4.190061] [ERET ]: ext4_delete_entry+0x13e/0x234 [ 4.202985] [STAT32]: 0x80080002 : IE K [ 4.207236] BTA: 0x9009329c SP: 0xbe5b1ec4 FP: 0x00000000 [ 4.212790] LPS: 0x9074b118 LPE: 0x9074b120 LPC: 0x00000000 [ 4.218348] r00: 0x00000040 r01: 0x00000021 r02: 0x00000001 [ 4.218348] r03: 0x00000000 r04: 0x00000002 r05: 0x00000000 [ 4.218348] r06: 0x000000c6 r07: 0x00000000 r08: 0x9050f140 [ 4.218348] r09: 0x000000c6 r10: 0x0000000a r11: 0x00000000 [ 4.218348] r12: 0x90247a9c r13: 0x9004e574 r14: 0x0008e150 [ 4.218348] r15: 0x000989b8 r16: 0x0008cbec r17: 0x0009806c [ 4.218348] r18: 0x0009806c r19: 0x0008e150 r20: 0x0008f0f8 [ 4.218348] r21: 0x000000ab r22: 0x0008f0f8 r23: 0x00000000 [ 4.218348] r24: 0x00000000 r25: 0x00000000 [ 4.218348] [ 4.218348] [ 4.270510] [ 4.270510] Stack Trace: [ 4.274510] ext4_delete_entry+0x13e/0x234 [ 4.278695] ext4_rmdir+0xe0/0x238 [ 4.282187] vfs_rmdir+0x50/0xf0 [ 4.285492] do_rmdir+0x9e/0x154 [ 4.288802] EV_Trap+0x110/0x114 --------------------->8-------------------
Signed-off-by: Alexey Brodkin abrodkin@synopsys.com Cc: stable@vger.kernel.org # 4.8+ --- arch/arc/include/asm/cache.h | 10 ++++++++++ 1 file changed, 10 insertions(+)
diff --git a/arch/arc/include/asm/cache.h b/arch/arc/include/asm/cache.h index f393b663413e..74f8fcaaef5c 100644 --- a/arch/arc/include/asm/cache.h +++ b/arch/arc/include/asm/cache.h @@ -52,6 +52,16 @@ #define cache_line_size() SMP_CACHE_BYTES #define ARCH_DMA_MINALIGN SMP_CACHE_BYTES
+/* + * Make sure slab-allocated buffers are 64-bit aligned. + * This is required for llockd/scondd to deal with 64-bit aligned dwords. + * By default ARCH_SLAB_MINALIGN is __alignof__(long long) which in + * case of ARC is 4 instead of 8! + */ +#ifdef CONFIG_ARC_HAS_LL64 +#define ARCH_SLAB_MINALIGN 8 +#endif + extern void arc_cache_init(void); extern char *arc_cache_mumbojumbo(int cpu_id, char *buf, int len); extern void read_decode_cache_bcr(void);
On 2/8/19 2:55 AM, Alexey Brodkin wrote:
By default ARCH_SLAB_MINALIGN is defined in "include/linux/slab.h" as "__alignof__(unsigned long long)" which looks fine but not for ARC.
Just for the record, the issue happens because a LLOCKD (exclusive 64-bit load) was trying to use a 32-bit aligned effective address (for atomic64_t), not allowed by ISA (LLOCKD can only take 64-bit aligned address, even when the CPU has unaligned access enabled).
This in turn was happening because this word is embedded in some other struct and happens to be 4 byte aligned
ARC tools ABI sets align of "long long" the same as for "long" = 4 instead of 8 one may think of.
Right, this was indeed unexpected and not like most other arches. ARCv2 ISA allows regular 64-bit loads/stores (LDD/STD) to take 32-bit aligned addresses. Thus ABI relaxing the alignment for 64-bit data potentially causes more packing and less space waste. But on the flip side we need to waste space at arbitrary places liek this.
So this is all good and theory, but I'm not 100% sure how slab alignment helps here (and is future proof). So the outer struct with embedded atomic64_t was allocated via slab and your patch ensures that outer struct is 64-bit aligned ?
But how does that guarantee that all embedded atomic64_t in there will be 64-bit aligned (in future say) in the light of ARC ABI and the gcc bug/feature which Peter alluded to
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=54188 https://gcc.gnu.org/bugzilla/show_bug.cgi?id=10360
Thus slab allocator may easily allocate a buffer which is 32-bit aligned. And most of the time it's OK until we start dealing with 64-bit atomics with special LLOCKD/SCONDD instructions which (as opposed to their 32-bit counterparts LLOCK/SCOND) operate with full 64-bit words but those words must be 64-bit aligned.
Some of this text needed to go above to give more context.
Fixes Ext4 folder removal: --------------------->8------------------- [ 4.015732] EXT4-fs (mmcblk0p2): re-mounted. Opts: (null) [ 4.167881]
..
Signed-off-by: Alexey Brodkin abrodkin@synopsys.com Cc: stable@vger.kernel.org # 4.8+
arch/arc/include/asm/cache.h | 10 ++++++++++ 1 file changed, 10 insertions(+)
diff --git a/arch/arc/include/asm/cache.h b/arch/arc/include/asm/cache.h index f393b663413e..74f8fcaaef5c 100644 --- a/arch/arc/include/asm/cache.h +++ b/arch/arc/include/asm/cache.h @@ -52,6 +52,16 @@ #define cache_line_size() SMP_CACHE_BYTES #define ARCH_DMA_MINALIGN SMP_CACHE_BYTES +/*
- Make sure slab-allocated buffers are 64-bit aligned.
- This is required for llockd/scondd to deal with 64-bit aligned dwords.
- By default ARCH_SLAB_MINALIGN is __alignof__(long long) which in
- case of ARC is 4 instead of 8!
- */
+#ifdef CONFIG_ARC_HAS_LL64 +#define ARCH_SLAB_MINALIGN 8 +#endif
extern void arc_cache_init(void); extern char *arc_cache_mumbojumbo(int cpu_id, char *buf, int len); extern void read_decode_cache_bcr(void);
From: Vineet Gupta
Sent: 12 February 2019 17:17
On 2/8/19 2:55 AM, Alexey Brodkin wrote:
By default ARCH_SLAB_MINALIGN is defined in "include/linux/slab.h" as "__alignof__(unsigned long long)" which looks fine but not for ARC.
Just for the record, the issue happens because a LLOCKD (exclusive 64-bit load) was trying to use a 32-bit aligned effective address (for atomic64_t), not allowed by ISA (LLOCKD can only take 64-bit aligned address, even when the CPU has unaligned access enabled).
This in turn was happening because this word is embedded in some other struct and happens to be 4 byte aligned
ARC tools ABI sets align of "long long" the same as for "long" = 4 instead of 8 one may think of.
Right, but __alignof__() doesn't have to return the alignment that would be used for a data item of the specified type. (Read the gcc 'bug' info for gory details.)
On i386 __alignof__(long long) is 8, but structure members of type 'long long' are 4 byte aligned and the alignment of a structure with a 'long long' member is only 4. (Although the microsoft compiler returns 4.)
Right, this was indeed unexpected and not like most other arches. ARCv2 ISA allows regular 64-bit loads/stores (LDD/STD) to take 32-bit aligned addresses. Thus ABI relaxing the alignment for 64-bit data potentially causes more packing and less space waste. But on the flip side we need to waste space at arbitrary places liek this.
So this is all good and theory, but I'm not 100% sure how slab alignment helps here (and is future proof). So the outer struct with embedded atomic64_t was allocated via slab and your patch ensures that outer struct is 64-bit aligned ?
Presumable 'atomic64_t' has an alignment attribute to force 8 byte alignment.
But how does that guarantee that all embedded atomic64_t in there will be 64-bit aligned (in future say) in the light of ARC ABI and the gcc bug/feature which Peter alluded to
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=54188 https://gcc.gnu.org/bugzilla/show_bug.cgi?id=10360
Thus slab allocator may easily allocate a buffer which is 32-bit aligned. And most of the time it's OK until we start dealing with 64-bit atomics with special LLOCKD/SCONDD instructions which (as opposed to their 32-bit counterparts LLOCK/SCOND) operate with full 64-bit words but those words must be 64-bit aligned.
Some of this text needed to go above to give more context.
I suspect the slab allocator should be returning 8 byte aligned addresses on all systems....
David
- Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK Registration No: 1397386 (Wales)
+CC some folks interested in alignment stuff in the past.
On 2/12/19 9:30 AM, David Laight wrote:
From: Vineet Gupta
Sent: 12 February 2019 17:17
On 2/8/19 2:55 AM, Alexey Brodkin wrote:
By default ARCH_SLAB_MINALIGN is defined in "include/linux/slab.h" as "__alignof__(unsigned long long)" which looks fine but not for ARC.
Just for the record, the issue happens because a LLOCKD (exclusive 64-bit load) was trying to use a 32-bit aligned effective address (for atomic64_t), not allowed by ISA (LLOCKD can only take 64-bit aligned address, even when the CPU has unaligned access enabled).
This in turn was happening because this word is embedded in some other struct and happens to be 4 byte aligned
ARC tools ABI sets align of "long long" the same as for "long" = 4 instead of 8 one may think of.
Right, but __alignof__() doesn't have to return the alignment that would be used for a data item of the specified type. (Read the gcc 'bug' info for gory details.)
On i386 __alignof__(long long) is 8, but structure members of type 'long long' are 4 byte aligned and the alignment of a structure with a 'long long' member is only 4. (Although the microsoft compiler returns 4.)
Exactly my point that this fudging of outer alignment is no magic bullet.
Right, this was indeed unexpected and not like most other arches. ARCv2 ISA allows regular 64-bit loads/stores (LDD/STD) to take 32-bit aligned addresses. Thus ABI relaxing the alignment for 64-bit data potentially causes more packing and less space waste. But on the flip side we need to waste space at arbitrary places liek this.
So this is all good and theory, but I'm not 100% sure how slab alignment helps here (and is future proof). So the outer struct with embedded atomic64_t was allocated via slab and your patch ensures that outer struct is 64-bit aligned ?
Presumable 'atomic64_t' has an alignment attribute to force 8 byte alignment.
It does for ARC
typedef struct { aligned_u64 counter; } atomic64_t;
But what was your point ?
But how does that guarantee that all embedded atomic64_t in there will be 64-bit aligned (in future say) in the light of ARC ABI and the gcc bug/feature which Peter alluded to
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=54188 https://gcc.gnu.org/bugzilla/show_bug.cgi?id=10360
Thus slab allocator may easily allocate a buffer which is 32-bit aligned. And most of the time it's OK until we start dealing with 64-bit atomics with special LLOCKD/SCONDD instructions which (as opposed to their 32-bit counterparts LLOCK/SCOND) operate with full 64-bit words but those words must be 64-bit aligned.
Some of this text needed to go above to give more context.
I suspect the slab allocator should be returning 8 byte aligned addresses on all systems....
why ? As I understand it is still not fool proof against the expected alignment of inner members. There ought to be a better way to enforce all this.
On Tue, Feb 12, 2019 at 09:45:53AM -0800, Vineet Gupta wrote:
+CC some folks interested in alignment stuff in the past.
On 2/12/19 9:30 AM, David Laight wrote:
From: Vineet Gupta
Sent: 12 February 2019 17:17
On 2/8/19 2:55 AM, Alexey Brodkin wrote:
By default ARCH_SLAB_MINALIGN is defined in "include/linux/slab.h" as "__alignof__(unsigned long long)" which looks fine but not for ARC.
Just for the record, the issue happens because a LLOCKD (exclusive 64-bit load) was trying to use a 32-bit aligned effective address (for atomic64_t), not allowed by ISA (LLOCKD can only take 64-bit aligned address, even when the CPU has unaligned access enabled).
This in turn was happening because this word is embedded in some other struct and happens to be 4 byte aligned
ARC tools ABI sets align of "long long" the same as for "long" = 4 instead of 8 one may think of.
Right, but __alignof__() doesn't have to return the alignment that would be used for a data item of the specified type. (Read the gcc 'bug' info for gory details.)
On i386 __alignof__(long long) is 8, but structure members of type 'long long' are 4 byte aligned and the alignment of a structure with a 'long long' member is only 4. (Although the microsoft compiler returns 4.)
Exactly my point that this fudging of outer alignment is no magic bullet.
IMO (and yes I knew about that i386 thing) this is just plain wrong. Of course we'll have to live with that crap, but that doesn't make it less crap.
Right, this was indeed unexpected and not like most other arches. ARCv2 ISA allows regular 64-bit loads/stores (LDD/STD) to take 32-bit aligned addresses. Thus ABI relaxing the alignment for 64-bit data potentially causes more packing and less space waste. But on the flip side we need to waste space at arbitrary places liek this.
So this is all good and theory, but I'm not 100% sure how slab alignment helps here (and is future proof). So the outer struct with embedded atomic64_t was allocated via slab and your patch ensures that outer struct is 64-bit aligned ?
Presumable 'atomic64_t' has an alignment attribute to force 8 byte alignment.
It does for ARC
typedef struct { aligned_u64 counter; } atomic64_t;
Note that atomic*_t is signed; also note that it doesn't matter in practise because -fno-strict-overflow.
Personally I think u64 and company should already force natural alignment; but alas. I though that was part of the reason we have __u64 and co., so that ABI is invariant to kernel alignment changes.
But how does that guarantee that all embedded atomic64_t in there will be 64-bit aligned (in future say) in the light of ARC ABI and the gcc bug/feature which Peter alluded to
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=54188 https://gcc.gnu.org/bugzilla/show_bug.cgi?id=10360
I strongly agree with all those that say __alignof__ is broken and argue for the C11 _Alignof/alignof semantics.
In particular I think that:
T x;
struct foo { T x; };
alignof(x) == alignof(foo::x)
And:
Aggregates (structures and arrays) and unions assume the alignment of their most strictly aligned component.
Otherwise none of this is remotely usable.
Thus slab allocator may easily allocate a buffer which is 32-bit aligned. And most of the time it's OK until we start dealing with 64-bit atomics with special LLOCKD/SCONDD instructions which (as opposed to their 32-bit counterparts LLOCK/SCOND) operate with full 64-bit words but those words must be 64-bit aligned.
Some of this text needed to go above to give more context.
I suspect the slab allocator should be returning 8 byte aligned addresses on all systems....
why ? As I understand it is still not fool proof against the expected alignment of inner members. There ought to be a better way to enforce all this.
I agree that for ARC ARCH_SLAB_MINALIGN should be at least 8.
In the past I've proposed a GCC plugin / checker that would verify the alignment requirements against the various allocators.
For instance:
struct foo { spinlock_t a; int b; } __cacheline_aligned;
struct foo *my_foo = kmalloc(sizeof(struct foo), GFP_KERNEL);
would result in a warning; because obviously kmalloc (as per ARCH_SLAB_MINALIGN) doesn't respect the cacheline alignment of the type.
Of course; it appears our kmalloc() function definition doesn't even have a __malloc attribute, so there's plenty work to be done here.
From: Peter Zijlstra
Sent: 13 February 2019 12:57
... ...
In the past I've proposed a GCC plugin / checker that would verify the alignment requirements against the various allocators.
For instance:
struct foo { spinlock_t a; int b; } __cacheline_aligned;
struct foo *my_foo = kmalloc(sizeof(struct foo), GFP_KERNEL);
would result in a warning; because obviously kmalloc (as per ARCH_SLAB_MINALIGN) doesn't respect the cacheline alignment of the type.
Of course; it appears our kmalloc() function definition doesn't even have a __malloc attribute, so there's plenty work to be done here.
We could pass the alignment to the allocator by defining something like:
#define do_malloc(x) ((x) = (typeof(*(x)))_do_malloc(sizeof *(x), __alignof__(*(x))))
Although you probably want to compile-time detect alignments that are smaller than the normal minimal alignment.
If the allocator needs to add a header it would need to use the byte before the allocated item to find the header. OTOH adding a header is horrid for page-sized items.
David
- Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK Registration No: 1397386 (Wales)
On 2/13/19 4:56 AM, Peter Zijlstra wrote:
Personally I think u64 and company should already force natural alignment; but alas.
But there is an ISA/ABI angle here too. e.g. On 32-bit ARC, LDD (load double) is allowed to take a 32-bit aligned address to load a register pair. Thus all u64 need not be 64-bit aligned (unless attribute aligned 8 etc) hence the relaxation in ABI (alignment of long long is 4). You could certainly argue that we end up undoing some of it anyways by defining things like ARCH_KMALLOC_MINALIGN to 8, but still...
I though that was part of the reason we have __u64 and co., so that ABI is invariant to kernel alignment changes.
Apparently not.
I suspect the slab allocator should be returning 8 byte aligned addresses on all systems....
why ? As I understand it is still not fool proof against the expected alignment of inner members. There ought to be a better way to enforce all this.
I agree that for ARC ARCH_SLAB_MINALIGN should be at least 8.
This issue aside, are there other reasons ? Because making it 8 on ARC is just pending the eventuality for later.
Hi Vineet, Peter, all,
-----Original Message----- From: Vineet Gupta vgupta@synopsys.com Sent: Thursday, February 14, 2019 2:24 AM To: Peter Zijlstra peterz@infradead.org Cc: David Laight David.Laight@ACULAB.COM; Alexey Brodkin alexey.brodkin@synopsys.com; linux-snps- arc@lists.infradead.org; Arnd Bergmann arnd.bergmann@linaro.org; linux-kernel@vger.kernel.org; stable@vger.kernel.org; Mark Rutland mark.rutland@arm.com Subject: Re: [PATCH] ARC: Explicitly set ARCH_SLAB_MINALIGN = 8
On 2/13/19 4:56 AM, Peter Zijlstra wrote:
Personally I think u64 and company should already force natural alignment; but alas.
But there is an ISA/ABI angle here too. e.g. On 32-bit ARC, LDD (load double) is allowed to take a 32-bit aligned address to load a register pair. Thus all u64 need not be 64-bit aligned (unless attribute aligned 8 etc) hence the relaxation in ABI (alignment of long long is 4). You could certainly argue that we end up undoing some of it anyways by defining things like ARCH_KMALLOC_MINALIGN to 8, but still...
I though that was part of the reason we have __u64 and co., so that ABI is invariant to kernel alignment changes.
Apparently not.
I suspect the slab allocator should be returning 8 byte aligned addresses on all systems....
why ? As I understand it is still not fool proof against the expected alignment of inner members. There ought to be a better way to enforce all this.
I agree that for ARC ARCH_SLAB_MINALIGN should be at least 8.
This issue aside, are there other reasons ? Because making it 8 on ARC is just pending the eventuality for later.
But that's pretty much the same for other 32-bit arches that have 64-bit atomics like ARM etc. From what I may see from ARM's documentation for LDREXD/SRREXD they require double-word alignment of data as well.
That said if for some reason atomic64_t variable is unaligned execution on any (or at least most) 32-bit architectures will lead to run-time failure, i.e. we'll know about it and this will be fixed.
And what I'm doing by that change (ARCH_SLAB_MINALIGN=8 for ARC) I'm just working-around peculiarity of ARC ABI.
Out of curiosity I checked if there're any other occurrences of "alingof(long long)" and there seems to be a couple of more: ----------------------------------->8----------------------------- # git grep alignof | grep "long long"
...
kernel/workqueue.c:5693: WARN_ON(__alignof__(struct pool_workqueue) < __alignof__(long long)); mm/slab.c:155:#define REDZONE_ALIGN max(BYTES_PER_WORD, __alignof__(unsigned long long)) mm/slab.c:2034: if (ralign > __alignof__(unsigned long long)) ----------------------------------->8-----------------------------
Not really sure how important is "kernel/workqueue.c" part but in case of "mm/slab.c" shouldn't we use ARCH_SLAB_MINALIGN there instead of that "not very meaningful" __alignof__(long long)?
-Alexey
On Thu, Feb 14, 2019 at 08:50:43AM +0000, Alexey Brodkin wrote:
But that's pretty much the same for other 32-bit arches that have 64-bit atomics like ARM etc. From what I may see from ARM's documentation for LDREXD/SRREXD they require double-word alignment of data as well.
That said if for some reason atomic64_t variable is unaligned execution on any (or at least most) 32-bit architectures will lead to run-time failure, i.e. we'll know about it and this will be fixed.
On x86_32 we have cmpxchg8b that 'likes' 8b alignment, our atomic64_32 implementation has the explicit alignment in, however there __alignof__(unsigned long long) is actually 8, so it all works.
Even though the hardware (obviously) never really requires alignment, even for atomic ops (although that's coming, yay!).
On 2/14/19 12:50 AM, Alexey Brodkin wrote:
I suspect the slab allocator should be returning 8 byte aligned addresses on all systems....
why ? As I understand it is still not fool proof against the expected alignment of inner members. There ought to be a better way to enforce all this.
I agree that for ARC ARCH_SLAB_MINALIGN should be at least 8.
This issue aside, are there other reasons ? Because making it 8 on ARC is just pending the eventuality for later.
But that's pretty much the same for other 32-bit arches that have 64-bit atomics like ARM etc. From what I may see from ARM's documentation for LDREXD/SRREXD they require double-word alignment of data as well.
Right LLOCKD/SCONDD (64-bit exclusive load/store) needs 64-bit aligned effective addresses for micro-arch reasons (1 vs 2 cache lines) etc.
So lets try to unpack this for me. Say we had.
struct foo { int a; atomic64_t b; };
The atomic64_t (which for ARC and most others is u64 __attribute__((aligned(8)) *already ensures* that there a 4 b padding is generated by gcc (I just confirmed with a simple test case).
#ifdef DOALIGN__ #define my_u64 __u64 __attribute__((aligned(8))) #else #define my_u64 __u64 #endif
struct foo on_heap;
printf(%d", &on_heap.b)
$ arc-linux-gcc -O2 test.c -DDOALIGN__ -c --save-temps
main: mov_s r1,@on_heap+8 <---- mov_s r0,@.LC0 b @printf
W/o the alignment attribute (say normal LDD/STD)
$ arc-linux-gcc -O2 test.c -c --save-temps
main: mov_s r1,@on_heap+4 mov_s r0,@.LC0 b @printf
So indeed your patch aligns dynamic structs to 64-bit, ensuring any embedded aligned_u64 to be 64-bit aligned as well. Phew !
That said if for some reason atomic64_t variable is unaligned execution on any (or at least most) 32-bit architectures will lead to run-time failure, i.e. we'll know about it and this will be fixed.
And what I'm doing by that change (ARCH_SLAB_MINALIGN=8 for ARC) I'm just working-around peculiarity of ARC ABI.
Right.
Out of curiosity I checked if there're any other occurrences of "alingof(long long)" and there seems to be a couple of more: ----------------------------------->8----------------------------- # git grep alignof | grep "long long"
...
kernel/workqueue.c:5693: WARN_ON(__alignof__(struct pool_workqueue) < __alignof__(long long)); mm/slab.c:155:#define REDZONE_ALIGN max(BYTES_PER_WORD, __alignof__(unsigned long long))
For ARC, it will be max(4,4) so 4 for others 32-bit,it will be max(4,8)
So indeed it makes sense to change it.
mm/slab.c:2034: if (ralign > __alignof__(unsigned long long)) ----------------------------------->8-----------------------------
Not really sure how important is "kernel/workqueue.c" part but in case of "mm/slab.c" shouldn't we use ARCH_SLAB_MINALIGN there instead of that "not very meaningful" __alignof__(long long)?
Hi Vineet,
-----Original Message----- From: Vineet Gupta vgupta@synopsys.com Sent: Friday, February 15, 2019 4:34 AM To: Alexey Brodkin alexey.brodkin@synopsys.com; Peter Zijlstra peterz@infradead.org Cc: David Laight David.Laight@ACULAB.COM; linux-snps-arc@lists.infradead.org; Arnd Bergmann arnd.bergmann@linaro.org; linux-kernel@vger.kernel.org; stable@vger.kernel.org; Mark Rutland mark.rutland@arm.com Subject: Re: [PATCH] ARC: Explicitly set ARCH_SLAB_MINALIGN = 8
On 2/14/19 12:50 AM, Alexey Brodkin wrote:
I suspect the slab allocator should be returning 8 byte aligned addresses on all systems....
why ? As I understand it is still not fool proof against the expected alignment of inner members. There ought to be a better way to enforce all this.
I agree that for ARC ARCH_SLAB_MINALIGN should be at least 8.
This issue aside, are there other reasons ? Because making it 8 on ARC is just pending the eventuality for later.
But that's pretty much the same for other 32-bit arches that have 64-bit atomics like ARM etc. From what I may see from ARM's documentation for LDREXD/SRREXD they require double-word alignment of data as well.
Right LLOCKD/SCONDD (64-bit exclusive load/store) needs 64-bit aligned effective addresses for micro-arch reasons (1 vs 2 cache lines) etc.
So lets try to unpack this for me. Say we had.
struct foo { int a; atomic64_t b; };
The atomic64_t (which for ARC and most others is u64 __attribute__((aligned(8)) *already ensures* that there a 4 b padding is generated by gcc (I just confirmed with a simple test case).
#ifdef DOALIGN__ #define my_u64 __u64 __attribute__((aligned(8))) #else #define my_u64 __u64 #endif
struct foo on_heap;
printf(%d", &on_heap.b)
$ arc-linux-gcc -O2 test.c -DDOALIGN__ -c --save-temps
main: mov_s r1,@on_heap+8 <---- mov_s r0,@.LC0 b @printf
W/o the alignment attribute (say normal LDD/STD)
$ arc-linux-gcc -O2 test.c -c --save-temps
main: mov_s r1,@on_heap+4 mov_s r0,@.LC0 b @printf
So indeed your patch aligns dynamic structs to 64-bit, ensuring any embedded aligned_u64 to be 64-bit aligned as well. Phew !
That said if for some reason atomic64_t variable is unaligned execution on any (or at least most) 32-bit architectures will lead to run-time failure, i.e. we'll know about it and this will be fixed.
And what I'm doing by that change (ARCH_SLAB_MINALIGN=8 for ARC) I'm just working-around peculiarity of ARC ABI.
Right.
So are you OK with this patch or something should be done before applying?
Out of curiosity I checked if there're any other occurrences of "alingof(long long)" and there seems to be a couple of more: ----------------------------------->8----------------------------- # git grep alignof | grep "long long"
...
kernel/workqueue.c:5693: WARN_ON(__alignof__(struct pool_workqueue) < __alignof__(long
long));
mm/slab.c:155:#define REDZONE_ALIGN max(BYTES_PER_WORD, __alignof__(unsigned long long))
For ARC, it will be max(4,4) so 4 for others 32-bit,it will be max(4,8)
So indeed it makes sense to change it.
I guess that's still a separate change for generic code, right? I.e. I'll do it in a separate patch.
-Alexey
On 2/18/19 12:53 AM, Alexey Brodkin wrote:
So are you OK with this patch or something should be done before applying?
Or your original patch was
+#ifdef CONFIG_ARC_HAS_LL64 +#define ARCH_SLAB_MINALIGN 8 +#endif
I don't think this issue is related to LL64 at all. A long long on ARC could be 4 byte aligned (as mandated by the ABI). It could use LDD or 2 separate LDs. This has no bearing on the fact that LLOCKD have to use a 64-bit aligned addresses. So I think we should just do this unconditionally. Agree ?
On Wed, Feb 13, 2019 at 03:23:36PM -0800, Vineet Gupta wrote:
On 2/13/19 4:56 AM, Peter Zijlstra wrote:
Personally I think u64 and company should already force natural alignment; but alas.
But there is an ISA/ABI angle here too. e.g. On 32-bit ARC, LDD (load double) is allowed to take a 32-bit aligned address to load a register pair. Thus all u64 need not be 64-bit aligned (unless attribute aligned 8 etc) hence the relaxation in ABI (alignment of long long is 4). You could certainly argue that we end up undoing some of it anyways by defining things like ARCH_KMALLOC_MINALIGN to 8, but still...
So what happens if the data is then split across two cachelines; will a STD vs LDD still be single-copy-atomic? I don't _think_ we rely on that for > sizeof(unsigned long), with the obvious exception of atomic64_t, but yuck...
So even though it is allowed by the chip; does it really make sense to use this?
Hi Peter,
-----Original Message----- From: Peter Zijlstra peterz@infradead.org Sent: Thursday, February 14, 2019 1:32 PM To: Vineet Gupta vineet.gupta1@synopsys.com Cc: David Laight David.Laight@ACULAB.COM; Alexey Brodkin alexey.brodkin@synopsys.com; linux-snps- arc@lists.infradead.org; Arnd Bergmann arnd.bergmann@linaro.org; linux-kernel@vger.kernel.org; stable@vger.kernel.org; Mark Rutland mark.rutland@arm.com Subject: Re: [PATCH] ARC: Explicitly set ARCH_SLAB_MINALIGN = 8
On Wed, Feb 13, 2019 at 03:23:36PM -0800, Vineet Gupta wrote:
On 2/13/19 4:56 AM, Peter Zijlstra wrote:
Personally I think u64 and company should already force natural alignment; but alas.
But there is an ISA/ABI angle here too. e.g. On 32-bit ARC, LDD (load double) is allowed to take a 32-bit aligned address to load a register pair. Thus all u64 need not be 64-bit aligned (unless attribute aligned 8 etc) hence the relaxation in ABI (alignment of long long is 4). You could certainly argue that we end up undoing some of it anyways by defining things like ARCH_KMALLOC_MINALIGN to 8, but still...
So what happens if the data is then split across two cachelines; will a STD vs LDD still be single-copy-atomic? I don't _think_ we rely on that for > sizeof(unsigned long), with the obvious exception of atomic64_t, but yuck...
STD & LDD are simple store/load instructions so there's no problem for their 64-bit data to be from 2 subsequent cache lines as well as 2 pages (if we're that unlucky). Or you mean something else?
So even though it is allowed by the chip; does it really make sense to use this?
It gives performance benefits when dealing with either 64-bit or even larger buffers, see how we use it in our string routines like here [1].
[1] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/arch...
-Alexey
On Thu, Feb 14, 2019 at 10:44:49AM +0000, Alexey Brodkin wrote:
On Wed, Feb 13, 2019 at 03:23:36PM -0800, Vineet Gupta wrote:
On 2/13/19 4:56 AM, Peter Zijlstra wrote:
Personally I think u64 and company should already force natural alignment; but alas.
But there is an ISA/ABI angle here too. e.g. On 32-bit ARC, LDD (load double) is allowed to take a 32-bit aligned address to load a register pair. Thus all u64 need not be 64-bit aligned (unless attribute aligned 8 etc) hence the relaxation in ABI (alignment of long long is 4). You could certainly argue that we end up undoing some of it anyways by defining things like ARCH_KMALLOC_MINALIGN to 8, but still...
So what happens if the data is then split across two cachelines; will a STD vs LDD still be single-copy-atomic? I don't _think_ we rely on that for > sizeof(unsigned long), with the obvious exception of atomic64_t, but yuck...
STD & LDD are simple store/load instructions so there's no problem for their 64-bit data to be from 2 subsequent cache lines as well as 2 pages (if we're that unlucky). Or you mean something else?
u64 x;
WRITE_ONCE(x, 0x1111111100000000); WRITE_ONCE(x, 0x0000000011111111);
vs
t = READ_ONCE(x);
is t allowed to be 0x1111111111111111 ?
If the data is split between two cachelines, the hardware must do something very funny to avoid that.
single-copy-atomicity requires that to never happen; IOW no load or store tearing. You must observe 'whole' values, no mixing.
Linux requires READ_ONCE()/WRITE_ONCE() to be single-copy-atomic for <=sizeof(unsigned long) and atomic*_read()/atomic*_set() for all atomic types. Your atomic64_t alignment should ensure this is so.
So while I think we're fine, I do find hardware instructions that tear yuck (yah, I know, x86...)
So even though it is allowed by the chip; does it really make sense to use this?
It gives performance benefits when dealing with either 64-bit or even larger buffers, see how we use it in our string routines like here [1].
[1] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/arch...
That doesn't require the ABI alignment crud.
Hi Peter,
-----Original Message----- From: linux-snps-arc linux-snps-arc-bounces@lists.infradead.org On Behalf Of Peter Zijlstra Sent: Thursday, February 14, 2019 2:08 PM To: Alexey Brodkin alexey.brodkin@synopsys.com Cc: Mark Rutland mark.rutland@arm.com; Vineet Gupta vineet.gupta1@synopsys.com; linux- kernel@vger.kernel.org; stable@vger.kernel.org; David Laight David.Laight@ACULAB.COM; Arnd Bergmann arnd.bergmann@linaro.org; linux-snps-arc@lists.infradead.org Subject: Re: [PATCH] ARC: Explicitly set ARCH_SLAB_MINALIGN = 8
On Thu, Feb 14, 2019 at 10:44:49AM +0000, Alexey Brodkin wrote:
On Wed, Feb 13, 2019 at 03:23:36PM -0800, Vineet Gupta wrote:
On 2/13/19 4:56 AM, Peter Zijlstra wrote:
Personally I think u64 and company should already force natural alignment; but alas.
But there is an ISA/ABI angle here too. e.g. On 32-bit ARC, LDD (load double) is allowed to take a 32-bit aligned address to load a register pair. Thus all u64 need not be 64-bit aligned (unless attribute aligned 8 etc) hence the relaxation in ABI (alignment of long long is 4). You could certainly argue that we end up undoing some of it anyways by defining things like ARCH_KMALLOC_MINALIGN to 8, but still...
So what happens if the data is then split across two cachelines; will a STD vs LDD still be single-copy-atomic? I don't _think_ we rely on that for > sizeof(unsigned long), with the obvious exception of atomic64_t, but yuck...
STD & LDD are simple store/load instructions so there's no problem for their 64-bit data to be from 2 subsequent cache lines as well as 2 pages (if we're that unlucky). Or you mean something else?
u64 x;
WRITE_ONCE(x, 0x1111111100000000); WRITE_ONCE(x, 0x0000000011111111);
vs
t = READ_ONCE(x);
is t allowed to be 0x1111111111111111 ?
If the data is split between two cachelines, the hardware must do something very funny to avoid that.
single-copy-atomicity requires that to never happen; IOW no load or store tearing. You must observe 'whole' values, no mixing.
Linux requires READ_ONCE()/WRITE_ONCE() to be single-copy-atomic for <=sizeof(unsigned long) and atomic*_read()/atomic*_set() for all atomic types. Your atomic64_t alignment should ensure this is so.
Thanks for explanation!
I'm not completely sure about single-copy-atomic for our LDD/STD instructions (need to check with HW guys) but given above requirement: ---------------------------->8-------------------------- READ_ONCE()/WRITE_ONCE() to be single-copy-atomic for <=sizeof(unsigned long) ---------------------------->8-------------------------- that's OK for them (LDD/STD) to not follow this, right? As they are obviously longer than "unsigned long".
Though I'm wondering if READ_ONCE()/WRITE_ONCE() could be used on 64-bit data even on 32-bit arches?
Now as for LLOCKD/SCONDD which implement single instruction 64-bit atomics require double-word alignment and so cannot possible span between cache lines.
So what am I missing here?
So while I think we're fine, I do find hardware instructions that tear yuck (yah, I know, x86...)
So even though it is allowed by the chip; does it really make sense to use this?
It gives performance benefits when dealing with either 64-bit or even larger buffers, see how we use it in our string routines like here [1].
3A__git.kernel.org_pub_scm_linux_kernel_git_torvalds_linux.git_tree_arch_arc_lib_memset-2Darchs.S- 23n81&d=DwICAg&c=DPL6_X_6JkXFx7AXWqB0tg&r=lqdeeSSEes0GFDDl656eViXO7breS55ytWkhpk5R81I&m=m60hCzPFQMtxeg 9HR5zZOJcRFMs6WLFJNSc6TNDqd4Y&s=Tapp7zbAmYYaTIaO5yKM0yUKfnaURFxdr56TS-JappQ&e=
That doesn't require the ABI alignment crud.
I'm not saying it has something to do with our ABI - that's just how we use it.
-Alexey
On Thu, Feb 14, 2019 at 12:05:47PM +0000, Alexey Brodkin wrote:
So what happens if the data is then split across two cachelines; will a STD vs LDD still be single-copy-atomic? I don't _think_ we rely on that for > sizeof(unsigned long), with the obvious exception of atomic64_t, but yuck...
STD & LDD are simple store/load instructions so there's no problem for their 64-bit data to be from 2 subsequent cache lines as well as 2 pages (if we're that unlucky). Or you mean something else?
u64 x;
WRITE_ONCE(x, 0x1111111100000000); WRITE_ONCE(x, 0x0000000011111111);
vs
t = READ_ONCE(x);
is t allowed to be 0x1111111111111111 ?
If the data is split between two cachelines, the hardware must do something very funny to avoid that.
single-copy-atomicity requires that to never happen; IOW no load or store tearing. You must observe 'whole' values, no mixing.
Linux requires READ_ONCE()/WRITE_ONCE() to be single-copy-atomic for <=sizeof(unsigned long) and atomic*_read()/atomic*_set() for all atomic types. Your atomic64_t alignment should ensure this is so.
Thanks for explanation!
I'm not completely sure about single-copy-atomic for our LDD/STD instructions (need to check with HW guys) but given above requirement: ---------------------------->8-------------------------- READ_ONCE()/WRITE_ONCE() to be single-copy-atomic for <=sizeof(unsigned long) ---------------------------->8-------------------------- that's OK for them (LDD/STD) to not follow this, right? As they are obviously longer than "unsigned long".
Correct.
Though I'm wondering if READ_ONCE()/WRITE_ONCE() could be used on 64-bit data even on 32-bit arches?
Some people were talking about just that a while ago; there's a number of 32bit platforms (including arm-v7) that can actually do this and there would be some performance benefits.
But this is currently not done (in generic code) afaik. And once they do this, it would be under some CONFIG knob.
In particular, things like:
- u64_stats_sync.h - most CONFIG_64BIT chunks from kernel/sched/fair.c
and ISTR there were some other cases.
Now as for LLOCKD/SCONDD which implement single instruction 64-bit atomics require double-word alignment and so cannot possible span between cache lines.
So what am I missing here?
atomic64_{set,read}() will use STD/LDD resp and should be single-copy-atomic vs the LLOCKD/SCONDD. But that again is aligned properly so I don't see problems there.
Aside of that just me being curious and paranoid at the same time.
From: Peter Zijlstra
Sent: 14 February 2019 11:08 On Thu, Feb 14, 2019 at 10:44:49AM +0000, Alexey Brodkin wrote:
On Wed, Feb 13, 2019 at 03:23:36PM -0800, Vineet Gupta wrote:
On 2/13/19 4:56 AM, Peter Zijlstra wrote:
Personally I think u64 and company should already force natural alignment; but alas.
But there is an ISA/ABI angle here too. e.g. On 32-bit ARC, LDD (load double) is allowed to take a 32-bit aligned address to load a register pair. Thus all u64 need not be 64-bit aligned (unless attribute aligned 8 etc) hence the relaxation in ABI (alignment of long long is 4). You could certainly argue that we end up undoing some of it anyways by defining things like ARCH_KMALLOC_MINALIGN to 8, but still...
So what happens if the data is then split across two cachelines; will a STD vs LDD still be single-copy-atomic? I don't _think_ we rely on that for > sizeof(unsigned long), with the obvious exception of atomic64_t, but yuck...
STD & LDD are simple store/load instructions so there's no problem for their 64-bit data to be from 2 subsequent cache lines as well as 2 pages (if we're that unlucky). Or you mean something else?
u64 x;
WRITE_ONCE(x, 0x1111111100000000); WRITE_ONCE(x, 0x0000000011111111);
vs
t = READ_ONCE(x);
is t allowed to be 0x1111111111111111 ?
If the data is split between two cachelines, the hardware must do something very funny to avoid that.
Never mind cachelines, think about separate pages. You'd have to 'lock' both TLB before doing either access. Not to mention the fact that the two physical locations could be on entirely different memory cards (or whatever).
David
- Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK Registration No: 1397386 (Wales)
linux-stable-mirror@lists.linaro.org