Hi toolchain, kernel folks,
I'm seeing an interesting thing on .got and .bss sections of arch/arm/boot/compressed/vmlinux, and really need your expertise to shed some lights.
I have an uninitialized variable 'uart_base' defined in misc.c.
static unsigned long uart_base;
$ arm-linux-gnueabi-objdump -D arch/arm/boot/compressed/misc.o
Disassembly of section .bss:
00000000 <uart_base>: 0: 00000000 andeq r0, r0, r0 00000004 <__machine_arch_type>: 4: 00000000 andeq r0, r0, r0 [...]
And section layout looks like the following.
$ arm-linux-gnueabi-objdump -h arch/arm/boot/compressed/vmlinux
SECTIOINS { ...
_etext = .;
_got_start = .; .got : { *(.got) } _got_end = .; .got.plt : { *(.got.plt) } _edata = .;
. = ALIGN(4); __bss_start = .; .bss : { *(.bss) } _end = .;
... }
$ arm-linux-gnueabi-objdump -h arch/arm/boot/compressed/vmlinux
arch/arm/boot/compressed/vmlinux: file format elf32-littlearm
Sections: Idx Name Size VMA LMA File off Algn 0 .text 001c474c 00000000 00000000 00008000 2**5 CONTENTS, ALLOC, LOAD, READONLY, CODE 1 .got 00000028 001c474c 001c474c 001cc74c 2**2 CONTENTS, ALLOC, LOAD, DATA 2 .got.plt 0000000c 001c4774 001c4774 001cc774 2**2 CONTENTS, ALLOC, LOAD, DATA 3 .bss 00000020 001c4780 001c4780 001cc780 2**2 ALLOC 4 .stack 00001000 001c47a0 001c47a0 001cc780 2**0 ALLOC 5 .comment 0000002a 00000000 00000000 001cc780 2**0 CONTENTS, READONLY 6 .ARM.attributes 00000031 00000000 00000000 001cc7aa 2**0 CONTENTS, READONLY
I'm able to see uart_base in vmlinux objdump ...
$ arm-linux-gnueabi-objdump -t arch/arm/boot/compressed/vmlinux [relevant only, and sorted] 00000000 l d .text 00000000 .text 001c474c l d .got 00000000 .got 001c4774 l d .got.plt 00000000 .got.plt 001c4780 l d .bss 00000000 .bss
003968e4 g *ABS* 00000000 _image_size 001c474c g *ABS* 00000000 _got_start 001c4774 g *ABS* 00000000 _got_end 001c4780 g *ABS* 00000000 _edata
001c4780 g *ABS* 00000000 __bss_start 001c4780 l O .bss 00000004 uart_base 001c4798 g O .bss 00000004 malloc_ptr 001c478c g O .bss 00000004 output_ptr 001c479c g O .bss 00000004 malloc_count 001c4794 g O .bss 00000004 free_mem_end_ptr 001c4788 g O .bss 00000004 output_data 001c4784 g O .bss 00000004 __machine_arch_type 001c4790 g O .bss 00000004 free_mem_ptr 001c47a0 g *ABS* 00000000 _end
... but I can not see it in the zImage (all others in .bss seem still there).
$ xxd arch/arm/boot/zImage | tail 01c4740: 3ef1 1400 be52 9f58 e468 3900 4c47 1c00 >....R.X.h9.LG.. ^_got_start (why is it?) 01c4750: 9847 1c00 1043 0000 8c47 1c00 9c47 1c00 .G...C...G...G.. ^malloc_ptr ^output_ptr 01c4760: 9447 1c00 080a 0000 8847 1c00 8447 1c00 .G.......G...G.. ^free_mem_end_ptr ^__machine_arch_type 01c4770: 9047 1c00 0000 0000 0000 0000 0000 0000 .G.............. ^free_mem_ptr
The following is a run-time memdump at _got_start.
Before recalculation: 9056304C: 001C474C 001C4798 00004310 001C478C 001C479C 001C4794 00000A08 001C4788 ^_got_start (why is it?) 9056306C: 001C4784 001C4790 00000000 00000000 00000000 EDFE0DD0 4C010000 38000000
After recalculation (for .bss entries, the delta is 9039EA50, for others in .got, delta is 9039E900): 9056304C: 9056304C 905631E8 903A2C10 905631DC 905631EC 905631E4 9039F308 905631D8 9056306C: 905631D4 905631E0 00000000 00000000 00000000 73FBC000 4C010000 38000000
QUESTION: Where is the .bss section of uart_base?
Now, I remove the 'static' to have 'unsigned long uart_base', and dump the same stuff to compare.
$ arm-linux-gnueabi-objdump -D arch/arm/boot/compressed/misc.o
Disassembly of section .bss:
00000000 <__machine_arch_type>: 0: 00000000 andeq r0, r0, r0
00000004 <uart_base>: 4: 00000000 andeq r0, r0, r0
I'm able to see uart_base in vmlinux objdump ...
$ arm-linux-gnueabi-objdump -t arch/arm/boot/compressed/vmlinux [relevant only, and sorted] 00000000 l d .text 00000000 .text 001c4720 l d .got 00000000 .got 001c474c l d .got.plt 00000000 .got.plt 001c4758 l d .bss 00000000 .bss
003968e4 g *ABS* 00000000 _image_size 001c4720 g *ABS* 00000000 _got_start 001c474c g *ABS* 00000000 _got_end 001c4758 g *ABS* 00000000 _edata
001c4758 g *ABS* 00000000 __bss_start 001c475c g O .bss 00000004 uart_base 001c4770 g O .bss 00000004 malloc_ptr 001c4764 g O .bss 00000004 output_ptr 001c4774 g O .bss 00000004 malloc_count 001c476c g O .bss 00000004 free_mem_end_ptr 001c4760 g O .bss 00000004 output_data 001c4758 g O .bss 00000004 __machine_arch_type 001c4768 g O .bss 00000004 free_mem_ptr 001c4778 g *ABS* 00000000 _end
... and I can also see it in the final zImage.
$ xxd arch/arm/boot/zImage | tail 01c4710: 221f f1b3 3ef1 1400 be52 9f58 e468 3900 "...>....R.X.h9. 01c4720: 5c47 1c00 2047 1c00 7047 1c00 e442 0000 \G.. G..pG...B.. ^uart_base 01c4730: 6447 1c00 7447 1c00 6c47 1c00 140a 0000 dG..tG..lG...... 01c4740: 6047 1c00 5847 1c00 6847 1c00 0000 0000 `G..XG..hG...... 01c4750: 0000 0000 0000 0000 ........
Surely, it's in the run-time memdump.
Before recalculation: 90563020: 001C475C 001C4720 001C4770 000042E4 001C4764 001C4774 001C476C 00000A14 ^uart_base 90563040: 001C4760 001C4758 001C4768 00000000 00000000 00000000 EDFE0DD0 4C010000
After recalculation: 90563020: 905631AC 90563020 905631C0 903A2BE4 905631B4 905631C4 905631BC 9039F314 90563040: 905631B0 905631A8 905631B8 00000000 00000000 00000000 EDFE0DD0 4C010000
So it looks the non-static ('g') uninitialized variable sits in .bss sections well, while static ('l') one is not there. Is this expected? How the static one is being addressed? Or ask where the offset for static one is stored?
Any info or comments are appreciated.
On Tue, Apr 19, 2011 at 06:13:01PM +0800, Shawn Guo wrote:
Hi toolchain, kernel folks,
I'm seeing an interesting thing on .got and .bss sections of arch/arm/boot/compressed/vmlinux, and really need your expertise to shed some lights.
I have an uninitialized variable 'uart_base' defined in misc.c.
static unsigned long uart_base;
[...]
I think this is explained by position-independence and symbol preemption issues.
The boot/compressed stuff is built with -fpic to make it position- independent, but to GCC this also means "might get dynamically linked".
This means that if uart_base is a global symbol, the compiler/linker have to cope with allowing it to be overriden from another shared library at dynamic link time.
Here's the code:
$ objdump -tdr arch/arm/boot/compressed/misc.o
[...]
00000008 g O .bss 00000004 uart_base
[...]
Disassembly of section .text:
00000000 <putc>: 0: 4b11 ldr r3, [pc, #68] ; (48 <putc+0x48>) 2: 4a12 ldr r2, [pc, #72] ; (4c <putc+0x4c>) 4: 447b add r3, pc 6: b430 push {r4, r5} 8: 5899 ldr r1, [r3, r2]
[...]
48: 00000040 .word 0x00000040 48: R_ARM_GOTPC _GLOBAL_OFFSET_TABLE_ 4c: 00000000 .word 0x00000000 4c: R_ARM_GOT32 uart_base [...]
As a side-effect, this causes the address of uart_base to appear in the GOT, since this is where the dynamic linker would patch the symbol address if overriding it with a symbol at another location.
Of course, for building the kernel this is all pointless because there will be no dynamic linking. But GCC has no concept of position-independent code in a non-dynamic-linking environment. GCC can be persuaded to optimise away most of the GOT references by passing -fvisibility=protected or -fvisibility=hidden.
If uart_base is _not_ global (as in the original code), it will never be preempted, since by definition only global symbols can ever be preempted during dynamic linking.
So the reference can be fixed up in a purely pc-relative way at link time, and the actual address of uart_base may not appear on the resulting image at all: here's the generated code:
$ objdump -td arch/arm/boot/compressed/vmlinux
003bdb40 l O .bss 00000004 uart_base
[...]
00000700 <putc>: 700: 4b0f ldr r3, [pc, #60] ; (740 <putc+0x40>) 702: b410 push {r4} 704: 447b add r3, pc
[...]
740: 003bd438 .word 0x003bd438
That 0x3bd438 is the reference to uart_base; i.e., 0x3bdb40 - <address of the "add r3, pc" instruction> - 4
If uart_base _is_ global but we also pass -fvisibility=hidden to the compiler, then the generated code is once again fully pc-relative, and the address of uart_base does not appear as a literal word in the resulting image.
Hopefully this explains what's going on, but what are you trying to achieve exactly?
Cheers ---Dave
On Tue, Apr 19, 2011 at 04:23:09PM +0100, Dave Martin wrote:
Hopefully this explains what's going on, but what are you trying to achieve exactly?
Thanks a ton, Dave. It does explain what I'm seeing, and your explanation looks like a very good learning material.
I'm running into a problem with John Bonies' append-dtb-to-zImage patch. That is the header of dtb was overwritten by uart_base value. John's patch did fix up .bss entries in .got to move them behind dtb image. But as you explained, when uart_base is defined as static one, its address is fixed up in pc-relative way at link time, and John's patch does not help it, hence the write to uart_base at runtime overwrites dtb image.
What do you think is the right fix to this problem? Forbid the use of static uninitialized variable? I'm afraid not. Is it possible to fix up the cases like uart_base here at runtime?
On Wed, Apr 20, 2011 at 12:08:56AM +0800, Shawn Guo wrote:
On Tue, Apr 19, 2011 at 04:23:09PM +0100, Dave Martin wrote:
Hopefully this explains what's going on, but what are you trying to achieve exactly?
Thanks a ton, Dave. It does explain what I'm seeing, and your explanation looks like a very good learning material.
I'm running into a problem with John Bonies' append-dtb-to-zImage patch. That is the header of dtb was overwritten by uart_base value. John's patch did fix up .bss entries in .got to move them behind dtb image. But as you explained, when uart_base is defined as static one, its address is fixed up in pc-relative way at link time, and John's patch does not help it, hence the write to uart_base at runtime overwrites dtb image.
What do you think is the right fix to this problem? Forbid the use of static uninitialized variable? I'm afraid not. Is it possible to fix up the cases like uart_base here at runtime?
So, if I understand correctly, because .bss doesn't take space in the zImage, when the dtb is appended, it effectively ends up on top of the bss/stack area?
Since the compressed kernel loader knows how big bss and the stack are, maybe the early zImage boot code can move the dtb out of the way before touching the stack or zeroing bss -- basically, it sounds like we need to move the dtb to the end of the stack in order for it to be safe.
We also need to avoid the space used beyond the end of the stack for heap memory: in compressed/head.S, it looks like the maximum heap is 0x10000 bytes, starting at the end of the stack. Maybe it would be better to declare this heap space explicitly in the linker script or some .s file -- we can then define a label at the end of it in the linker script, instead of the magic number arithmetic which is done currently.
Cheers ---Dave
On Tue, 19 Apr 2011, Dave Martin wrote:
So, if I understand correctly, because .bss doesn't take space in the zImage, when the dtb is appended, it effectively ends up on top of the bss/stack area?
Yes. However...
Since the compressed kernel loader knows how big bss and the stack are, maybe the early zImage boot code can move the dtb out of the way before touching the stack or zeroing bss -- basically, it sounds like we need to move the dtb to the end of the stack in order for it to be safe.
What the current code does is to leave the DTB in place and modify the GOT to move .bss references and the stack away instead. The XIPable zImage also relies on that "feature". See for example this line in arch/arm/boot/compressed/uncompress.c:
#define STATIC_RW_DATA /* non-static please */
It is also much cheaper to modify the GOT rather than relocating the DTB. And given that some people would like to use the same trick with ramdisk images then this is even more significant.
Once upon a time the kernel decompressor was built with -Dstatic="" but that didn't work with the zlib code rewrite as this affects functions as well as const data.
Nicolas
On Wed, 20 Apr 2011, Shawn Guo wrote:
On Tue, Apr 19, 2011 at 04:23:09PM +0100, Dave Martin wrote:
Hopefully this explains what's going on, but what are you trying to achieve exactly?
Thanks a ton, Dave. It does explain what I'm seeing, and your explanation looks like a very good learning material.
I'm running into a problem with John Bonies' append-dtb-to-zImage patch. That is the header of dtb was overwritten by uart_base value. John's patch did fix up .bss entries in .got to move them behind dtb image. But as you explained, when uart_base is defined as static one, its address is fixed up in pc-relative way at link time, and John's patch does not help it, hence the write to uart_base at runtime overwrites dtb image.
What do you think is the right fix to this problem? Forbid the use of static uninitialized variable? I'm afraid not. Is it possible to fix up the cases like uart_base here at runtime?
You must not use static variable in the decompressor. For one thing, that breaks the ability to XIP the decompressor code and move writable data elsewhere.
So the fix is indeed to _not_ declare any global variable as static in this case.
Nicolas
On Tue, Apr 19, 2011 at 01:33:12PM -0400, Nicolas Pitre wrote:
On Wed, 20 Apr 2011, Shawn Guo wrote:
On Tue, Apr 19, 2011 at 04:23:09PM +0100, Dave Martin wrote:
Hopefully this explains what's going on, but what are you trying to achieve exactly?
Thanks a ton, Dave. It does explain what I'm seeing, and your explanation looks like a very good learning material.
I'm running into a problem with John Bonies' append-dtb-to-zImage patch. That is the header of dtb was overwritten by uart_base value. John's patch did fix up .bss entries in .got to move them behind dtb image. But as you explained, when uart_base is defined as static one, its address is fixed up in pc-relative way at link time, and John's patch does not help it, hence the write to uart_base at runtime overwrites dtb image.
What do you think is the right fix to this problem? Forbid the use of static uninitialized variable? I'm afraid not. Is it possible to fix up the cases like uart_base here at runtime?
You must not use static variable in the decompressor. For one thing, that breaks the ability to XIP the decompressor code and move writable data elsewhere.
So the fix is indeed to _not_ declare any global variable as static in this case.
After some thinking about this, I think I agree.
Having to relocate a GOT-full of addresses many of which are actually at fixed PC-relative offsets just for this capability is a bit annoying, but the GNU tools don't support other models very well.
We might be able to reduce the size of the GOT by building with -fvisibility=hidden, and making judicious use of "extern" on all data declarations/definitions:
[gcc-4.4.info] `extern' declarations are not affected by `-fvisibility', so a lot of code can be recompiled with `-fvisibility=hidden' with no modifications. However, this means that calls to `extern' functions with no explicit visibility will use the PLT, so it is more effective to use `__attribute ((visibility))' and/or `#pragma GCC visibility' to tell the compiler which `extern' declarations should be treated as hidden.
This only seems to work reliably for data definitions; plus the toolchain behaviour may "evolve" with respect to obscure features like this. So if we wanted to achieve such a thing reliably, we'd probably need explicit visibility attributes on the affected declarations.
The advantage is unlikely to be huge though since the GOT is small anyway; and we wouldn't be able to throw away the GOT relocation code completely, beacuse of the need to relocate bss references...
Cheers ---Dave
On Wed, 20 Apr 2011, Dave Martin wrote:
On Tue, Apr 19, 2011 at 01:33:12PM -0400, Nicolas Pitre wrote:
You must not use static variable in the decompressor. For one thing, that breaks the ability to XIP the decompressor code and move writable data elsewhere.
So the fix is indeed to _not_ declare any global variable as static in this case.
After some thinking about this, I think I agree.
Having to relocate a GOT-full of addresses many of which are actually at fixed PC-relative offsets just for this capability is a bit annoying, but the GNU tools don't support other models very well.
You cannot relocate PC-relative offsets at run time. Those references are spread throughout the code into literal pools. Forcing all references to go through the GOT makes it possible for the code to relocate selected parts of itself at run time.
We might be able to reduce the size of the GOT by building with -fvisibility=hidden, and making judicious use of "extern" on all data declarations/definitions:
[gcc-4.4.info] `extern' declarations are not affected by `-fvisibility', so a lot of code can be recompiled with `-fvisibility=hidden' with no modifications. However, this means that calls to `extern' functions with no explicit visibility will use the PLT, so it is more effective to use `__attribute ((visibility))' and/or `#pragma GCC visibility' to tell the compiler which `extern' declarations should be treated as hidden.
This only seems to work reliably for data definitions; plus the toolchain behaviour may "evolve" with respect to obscure features like this.
That doesn't solve the problem at all. In this case, we really want _all_ data references to go through the GOT, meaning that everything would have to be marked extern. The only references which are OK to be PC relative are read-only references, and therefore they can just be marked as static const.
So if we wanted to achieve such a thing reliably, we'd probably need explicit visibility attributes on the affected declarations.
Like I said, it's about all of them.
The advantage is unlikely to be huge though since the GOT is small anyway; and we wouldn't be able to throw away the GOT relocation code completely, beacuse of the need to relocate bss references...
In fact, all that remains in the GOT, assuming that const data is marked static, are .bss references. Again, for simplicity's sake, we don't support initialized and writable global variables as in the XIP case those would have to be copied into RAM and the GOT patched accordingly. In practice this is not hard to achieve. To ensure that, we simply discard the .data early in the linker script.
Nicolas
Hi,
On Wed, Apr 20, 2011 at 1:42 PM, Nicolas Pitre nicolas.pitre@linaro.org wrote:
On Wed, 20 Apr 2011, Dave Martin wrote:
On Tue, Apr 19, 2011 at 01:33:12PM -0400, Nicolas Pitre wrote:
You must not use static variable in the decompressor. For one thing, that breaks the ability to XIP the decompressor code and move writable data elsewhere.
So the fix is indeed to _not_ declare any global variable as static in this case.
After some thinking about this, I think I agree.
Having to relocate a GOT-full of addresses many of which are actually at fixed PC-relative offsets just for this capability is a bit annoying, but the GNU tools don't support other models very well.
You cannot relocate PC-relative offsets at run time. Those references are spread throughout the code into literal pools. Forcing all references to go through the GOT makes it possible for the code to relocate selected parts of itself at run time.
My point was that relocatability implies overhead, and the GOT potentially contains a load of relocations for code and read-only data which will never get moved in practice.
For writable/uninitialised data, it's different of course -- we often will need to relocate that in real situations (as observed here). I'd guessed that only part of the GOT in the compressed loader was addressing such data, but actually, it seems to be pretty much all of it, as you suggest.
So the number of useless relocations, and any associated overhead, looks low (if any).
We might be able to reduce the size of the GOT by building with -fvisibility=hidden, and making judicious use of "extern" on all data declarations/definitions:
[gcc-4.4.info] `extern' declarations are not affected by `-fvisibility', so a lot of code can be recompiled with `-fvisibility=hidden' with no modifications. However, this means that calls to `extern' functions with no explicit visibility will use the PLT, so it is more effective to use `__attribute ((visibility))' and/or `#pragma GCC visibility' to tell the compiler which `extern' declarations should be treated as hidden.
This only seems to work reliably for data definitions; plus the toolchain behaviour may "evolve" with respect to obscure features like this.
That doesn't solve the problem at all. In this case, we really want _all_ data references to go through the GOT, meaning that everything would have to be marked extern. The only references which are OK to be PC relative are read-only references, and therefore they can just be marked as static const.
So if we wanted to achieve such a thing reliably, we'd probably need explicit visibility attributes on the affected declarations.
Like I said, it's about all of them.
The advantage is unlikely to be huge though since the GOT is small anyway; and we wouldn't be able to throw away the GOT relocation code completely, beacuse of the need to relocate bss references...
In fact, all that remains in the GOT, assuming that const data is marked static, are .bss references. Again, for simplicity's sake, we don't support initialized and writable global variables as in the XIP case those would have to be copied into RAM and the GOT patched accordingly. In practice this is not hard to achieve. To ensure that, we simply discard the .data early in the linker script.
Sure -- my observations were simply based around the fact that we're using the tools to do something they don't feel well adapted to, compared with other tools with a more embedded/bare-metal focus. So if there were a better or more correct way to use the tools to get the results we need, it would be worth considering. But from the discussion it sounds like the code already does pretty much the best thing possible anyway.
Cheers ---Dave
On Wed, 20 Apr 2011, Dave Martin wrote:
Hi,
On Wed, Apr 20, 2011 at 1:42 PM, Nicolas Pitre nicolas.pitre@linaro.org wrote:
On Wed, 20 Apr 2011, Dave Martin wrote:
On Tue, Apr 19, 2011 at 01:33:12PM -0400, Nicolas Pitre wrote:
You must not use static variable in the decompressor. For one thing, that breaks the ability to XIP the decompressor code and move writable data elsewhere.
So the fix is indeed to _not_ declare any global variable as static in this case.
After some thinking about this, I think I agree.
Having to relocate a GOT-full of addresses many of which are actually at fixed PC-relative offsets just for this capability is a bit annoying, but the GNU tools don't support other models very well.
You cannot relocate PC-relative offsets at run time. Those references are spread throughout the code into literal pools. Forcing all references to go through the GOT makes it possible for the code to relocate selected parts of itself at run time.
My point was that relocatability implies overhead, and the GOT potentially contains a load of relocations for code and read-only data which will never get moved in practice.
Sure, for code (already implicit) or ro data, using GOTOFF relocs is perfectly fine. As long as the relevant data is marked const then there is no issue also marking it static, at which point the same effect as -fvisibility=hidden is achieved i.e. no GOT entries are allocated.
For writable/uninitialised data, it's different of course -- we often will need to relocate that in real situations (as observed here). I'd guessed that only part of the GOT in the compressed loader was addressing such data, but actually, it seems to be pretty much all of it, as you suggest.
Yes, and in practice it contains only between 6 and 8 entries depending on the config used. And all of them are references to .bss variables. So the overhead is pretty small.
Nicolas
linaro-toolchain@lists.linaro.org