On 17 December 2013 07:53, Michael Hudson-Doyle michael.hudson@linaro.org wrote:
Michael Hudson-Doyle michael.hudson@linaro.org writes:
Michael Hudson-Doyle michael.hudson@linaro.org writes:
Will Newton will.newton@linaro.org writes:
On 16 December 2013 03:36, Michael Hudson-Doyle michael.hudson@linaro.org wrote:
Michael Hudson-Doyle michael.hudson@linaro.org writes:
Aaah, you might be onto something there. I built myself a cross gcc-4.8 today and it appeared to compile things correctly (I didn't actually get to run it, but the objdump poking looked right) and I got a bit worried that this was all down to some cosmic ray / corruption when I first compiled it. But, the scripts I cargo culted just use compile binutils from git tip, so if the bug is in binutils...
So I still don't know what's going on, exactly, but I have a debug build of binutils now and some clues. It still only happens on real hardware, not cross compiling on my laptop, but I think I have an idea as to why. This might be complete crack, but anyway.
I think it's to do with the order of things within the GOT.
When I cross compile, sort the relocations by address, then count up the number of relocations of each type, it looks like this:
$ objdump -C -R build/linux2/*/mongo/base/counter_test | LC_ALL=C sort | cut -d' ' -f 2 | uniq -c 4 496 R_AARCH64_GLOB_DAT 1 R_AARCH64_TLS_TPREL64 103 R_AARCH64_GLOB_DAT 305 R_AARCH64_JUMP_SLOT 12 R_AARCH64_COPY 1 RELOCATION 2
In this case, the code and the relocation agree on where the thread local variable is.
When I compile natively, it looks like this:
(t-mwhudson)ubuntu@arm64:~/src/mongo$ objdump -C -R build/linux2/*/mongo/base/counter_test | LC_ALL=C sort | cut -d' ' -f 2 | uniq -c 4 295 R_AARCH64_JUMP_SLOT 496 R_AARCH64_GLOB_DAT 1 R_AARCH64_TLS_TPREL64 104 R_AARCH64_GLOB_DAT 12 R_AARCH64_COPY 1 RELOCATION 2
And the code and the relocation disagree on where the thread local variable is -- by 298 * sizeof(void*). Which is almost (but I admit, not exactly) the number of JUMP_SLOTs that are, in this case, before the TLS variable in the GOT. When I compiled in a different way, there were only 160 JUMP_SLOTs before the TLS reloc, and the code and relocation disagreed by 163 slots.
So is it possible somehow that the GOT has these JUMP_SLOTs inserted into it after the relocation for the TLS has been written out? I don't really see how but maybe this rings a bell...
Indeed it does. ;-)
A similar issue was caused by commit 692e2b8bcdd8325ebfbe1daace87100d53d15ad6 (which adds ifunc support to the aarch64 ld backend) but was intended to be fixed by the rework of the same code in 1419bbe5712af8a43f1d63feb32687649563426d. However I was never actually able to reproduce the failure case (I saw binaries that were broken so I know it could happen) so the fix was somewhat speculative. Hence I am very interested in finding a reproducible case where this GOT entry misordering happens!
I'm possibly doing something wrong, but I've tried to try compiling the suspect binary with both binutils git tip and the commit before 692e2b8bc but both had the problem. So I guess it's something else, or I wasn't testing what I thought I was testing.
Argh, I wasn't testing what I thought I was testing... trying again.
Ah... found it! This is the code that determines the offset to patch into the code (elfnn-aarch64.c line 3845):
value = (symbol_got_offset (input_bfd, h, r_symndx) + globals->root.sgot->output_section->vma + globals->root.sgot->output_section->output_offset);
and this is the code that determines the offset as written into the relocation (elfnn-aarch64.c line 4248):
off = symbol_got_offset (input_bfd, h, r_symndx); ... rela.r_offset = globals->root.sgot->output_section->vma + globals->root.sgot->output_offset + off;
Can you see the difference? The former is "root.sgot->output_section->output_offset", the latter is "root.sgot->output_offset".
Yes, that does look a bit odd.
This suggests the rather obvious attached patch. I haven't tested this exact patch, but its an obvious translation from a patch to 692e2b8bcdd8325ebfbe1daace87100d53d15ad6^ which does work. I also haven't tested the second hunk at all, but it seems plausible...
Thanks for you analysis, the fix does look plausible indeed. ;-)
Have you verified it fixes the problem you were seeing?
I'm about to disappear to sunnier climes for three weeks but I'll definitely look at it when I get back. I've added Marcus to CC in case he isn't reading this list.