On 6 November 2012 02:48, Rob Herring <robherring2(a)gmail.com> wrote:
>
> On 11/05/2012 05:13 AM, Russell King - ARM Linux wrote:
> > On Mon, Nov 05, 2012 at 10:48:50AM +0000, Dave Martin wrote:
> >> On Thu, Oct 25, 2012 at 05:08:16PM +0200, Johannes Stezenbach wrote:
> >>> On Thu, Oct 25, 2012 at 09:25:06AM -0500, Rob Herring wrote:
> >>>> On 10/25/2012 09:16 AM, Johannes Stezenbach wrote:
> >>>>> On Thu, Oct 25, 2012 at 07:41:45AM -0500, Rob Herring wrote:
> >>>>>> On 10/25/2012 04:34 AM, Johannes Stezenbach wrote:
> >>>>>>> On Thu, Oct 11, 2012 at 07:43:22AM -0500, Rob Herring wrote:
> >>>>>>>
> >>>>>>>> While v6 can support unaligned accesses, it is optional and current
> >>>>>>>> compilers won't emit unaligned accesses. So we don't clear the A bit for
> >>>>>>>> v6.
> >>>>>>>
> >>>>>>> not true according to the gcc changes page
> >>>>>>
> >>>>>> What are you going to believe: documentation or what the compiler
> >>>>>> emitted? At least for ubuntu/linaro 4.6.3 which has the unaligned access
> >>>>>> support backported and 4.7.2, unaligned accesses are emitted for v7
> >>>>>> only. I guess default here means it is the default unless you change the
> >>>>>> default in your build of gcc.
> >>>>>
> >>>>> Since ARMv6 can handle unaligned access in the same way as ARMv7
> >>>>> it seems a clear bug in gcc which might hopefully get fixed.
> >>>>> Thus in this case I think it is reasonable to follow the
> >>>>> gcc documentation, otherwise the code would break for ARMv6
> >>>>> when gcc gets fixed.
> >>>>
> >>>> But the compiler can't assume the state of the U bit. I think it is
> >>>> still legal on v6 to not support unaligned accesses, but on v7 it is
> >>>> required. All the standard v6 ARM cores support it, but I'm not sure
> >>>> about custom cores or if there are SOCs with buses that don't support
> >>>> unaligned accesses properly.
> >>>
> >>> Well, I read the "...since Linux version 2.6.28" comment
> >>> in the gcc changes page in the way that they assume the
> >>> U-bit is set. (Although I'm not sure it really is???)
> >>
> >> Actually, the kernel checks the arch version and the U bit on boot,
> >> and chooses the appropriate setting for the A bit depending on the
> >> result. (See arch/arm/mm/alignment.c:alignment_init().)
> >
> > That is in the kernel itself, _after_ the decompressor has run. It is
> > not relevant to any discussion about the decompressor.
> >
> >> Currently, we depend on the CPU reset behaviour or firmware/
> >> bootloader to set the U bit for v6, but the behaviour should be
> >> correct either way, though unaligned accesses will obviously
> >> perform (much) better with U=1.
> >
> > Will someone _PLEASE_ address my initial comments against this patch
> > in light of the fact that it's now been proven _NOT_ to be just a V7
> > issue, rather than everyone seemingly buring their heads in the sand
> > over this.
>
> I tried adding -munaligned-accesses on a v6 build and still get byte
> accesses rather than unaligned word accesses. So this does seem to be a
> v7 only issue based on what gcc will currently produce. Copying Michael
> Hope who can hopefully provide some insight on why v6 unaligned accesses
> are not enabled.
This looks like a bug. Unaligned access is enabled for armv6 but
seems to only take effect for cores with Thumb-2. Here's a test case
both with unaligned field access and unaligned block copy:
struct foo
{
char a;
int b;
struct
{
int x[3];
} c;
} __attribute__((packed));
int get_field(struct foo *p)
{
return p->b;
}
int copy_block(struct foo *p, struct foo *q)
{
p->c = q->c;
}
With -march=armv7-a you get the correct:
bar:
ldr r0, [r0, #1] @ unaligned @ 11 unaligned_loadsi/2 [length = 4]
bx lr @ 21 *arm_return [length = 12]
baz:
str r4, [sp, #-4]! @ 25 *push_multi [length = 4]
mov r2, r0 @ 2 *arm_movsi_vfp/1 [length = 4]
ldr r4, [r1, #5]! @ unaligned @ 9 unaligned_loadsi/2 [length = 4]
ldr ip, [r1, #4] @ unaligned @ 10 unaligned_loadsi/2 [length = 4]
ldr r1, [r1, #8] @ unaligned @ 11 unaligned_loadsi/2 [length = 4]
str r4, [r2, #5] @ unaligned @ 12 unaligned_storesi/2 [length = 4]
str ip, [r2, #9] @ unaligned @ 13 unaligned_storesi/2 [length = 4]
str r1, [r2, #13] @ unaligned @ 14 unaligned_storesi/2 [length = 4]
ldmfd sp!, {r4}
bx lr
With -march=armv6 you get a byte-by-byte field access and a correct
unaligned block copy:
bar:
ldrb r1, [r0, #2] @ zero_extendqisi2
ldrb r3, [r0, #1] @ zero_extendqisi2
ldrb r2, [r0, #3] @ zero_extendqisi2
ldrb r0, [r0, #4] @ zero_extendqisi2
orr r3, r3, r1, asl #8
orr r3, r3, r2, asl #16
orr r0, r3, r0, asl #24
bx lr
baz:
str r4, [sp, #-4]!
mov r2, r0
ldr r4, [r1, #5]! @ unaligned
ldr ip, [r1, #4] @ unaligned
ldr r1, [r1, #8] @ unaligned
str r4, [r2, #5] @ unaligned
str ip, [r2, #9] @ unaligned
str r1, [r2, #13] @ unaligned
ldmfd sp!, {r4}
bx lr
readelf -A shows that the compiler planned to use unaligned access in
both. My suspicion is that GCC is using the extv pattern to extract
the field from memory, and that pattern is only enabled for Thumb-2
capable cores.
I've logged PR55218. We'll discuss it at our next meeting.
-- Michael
[Short week: 3 days]
* looked at (but failed to reproduce) a hang in QEMU reported
by Christoffer when shutting down a KVM ARM guest using TUN/TAP
networking
* investigated LP:1084148 (segfault in qemu usermode) sufficiently
to diagnose it as probably another of qemu's "can't handle
multithreaded guest programs" bugs
* fixed some problems with QEMU's secondary CPU boot code which
were masked by errors in QEMU's GIC model but revealed by
real hardware (ie KVM); fixed the GIC model bugs as well
* investigated LP:955379 (cmake hangs under qemu-arm-static).
Tracked down to a race condition involving signal delivery,
the fix to which would require the significant redesign I
sketched out here a year or so ago:
http://lists.gnu.org/archive/html/qemu-devel/2011-12/msg00384.html
KVM blueprint progress tracker:
http://ex.seabright.co.nz/helpers/backlog?group_by=topic&colour_by=state&pr…
-- PMM
== Blueprints ==
Initial Current Actual
initial-aarch64-backport 31 Oct 2012 7 Dec 2012*
aarch64-baremetal-testing 31 Oct 2012 7 Dec 2012*
fix-gcc-multiarch-testing 31 Dec 2012 31 Dec 2012
backport-fma-intrinsic 31 Dec 2012 31 Dec 2012
fused-multiply-add-support 31 Dec 2012 31 Dec 2012
gcc-investigate-lra-for-arm 31 Dec 2012 31 Dec 2012
== Progress ==
* Admin
* Interviewing
* Preparation for taking over from Michael
* Investigate patches for literal pool layout bug
* Applied
* PINGed triplet backport patches upstream
* Other bug issues
* Including an issue running SPEC2K on x86 with recent trunk
* And a 4.6 gcc-linaro only issue
== Next Week ==
* Start leading Toolchain team
* Run HOT/COLD partitioning benchmarks
* Analyse ARM results
* On x86_64 to see what the actual benefit we could get
* initial-aarch64-backport & aarch64-baremetal-testing
* Finish documentation
* gcc-investigate-lra-for-arm
* Analyse benchmarks
* fix-gcc-multiarch-testing
* Come up with strawman proposal for updating testsuite to handle
testing with varying command-line options.
== Future ==
* backport-fma-intrinsic & fused-multiply-add-support
* Backport patches once fix-gcc-multiarch-testing has been done.
== Planned Leave ==
* Monday 24 December - Monday 31 December
--
Matthew Gretton-Dann
Linaro Toolchain Working Group
matthew.gretton-dann(a)linaro.org
Hi,
I think I have identified some issues with the atomic builtins, but I want
your advises.
For instance :
A: __atomic_store_n (addr, val, __ATOMIC_SEQ_CST);
gives the armv7 code:
DMB sy
STR r1, [r0]
DMB sy
but if I have well understood, the DMBs instructions only provide the
property that the
code is sequentially consistent, but not the atomicity for which we have to
use the
LDREX/STREX instructions. Thus I think that the code should be :
DMB sy
1: LDREX r2, [r0]
STREX r1, r2, [r0]
TEQ r1, #0
BNE 1b
B: __atomic_load_n (addr, __ATOMIC_ACQUIRE);
gives the armv7 code:
DMB sy
LDR r0, [r0]
but the load-acquire semantique specifies that all loads and stores
appearing in program order
after the load-acquire will be observed after the load-acquire, thus the
DMB should be after the
LDR, no ?
--
Yvan
Hi,
I'm working on the libatomic-ops (part of the Boehm gc) AArch64 support,
I mainly use GCC's __atomic builtins to do this, but in our 4.7 version
they don't use the load acquire / store release instructions now available
in the ARMv8 ISA. These instructions are used in the mainline GCC
(in atomic.md) but not in their exclusive form, I understand that it should
be due to the performance penalty, but I want your feeling on that point
as I don't find the ARMv8 ISA really clear.
If we want to implement an atomic load acquire, is
LDAR x1, [x0]
sufficient, or do we have to write it like that :
L: LDAXR x0, [x3]
STEX x1, x0, [x3]
CBZ x0, L1
Thanks
Yvan
All,
[Editiorial: Michael & I discussed making what we do as a working
group more visible at Connect. One thing we discussed was making our
meeting minutes more visible by emailing actions out after each
meeting. This will be part of the job of the 'minuter' - a job I plan
to spread around as I am useless at it whilst also running a call -
more info on the Wiki:
https://wiki.linaro.org/WorkingGroups/ToolChain/Meetings]
The minutes of the performance call held on 27 November 2012 can be found at:
https://wiki.linaro.org/WorkingGroups/ToolChain/Meetings/2012-11-27
In summary the actions from the meting are:
* mgrettondann split LRA blueprint
* Christophe to update Hot/Cold partitioning bugzilla
* mgrettondann: benchmark on Hold/Cold partitioning
* Michael to log a ticket to improve reporting of benchmarks when the
run complete.
* Ramana to log EEMBC failure with Hot/Cold partitioning into bugzilla.
* Christophe to backport bswap16 builtin, except for the testcase
which fails in one of our configurations (Thumb1 + hard FP ABI)
The next performance call will be on 11 December 2012 and the agenda
can be found at:
https://wiki.linaro.org/WorkingGroups/ToolChain/Meetings/2012-12-11
Thanks,
Matt
--
Matthew Gretton-Dann
Linaro Toolchain Working Group
matthew.gretton-dann(a)linaro.org
The Linaro Toolchain Working Group is pleased to announce the 2012.11
release of the Linaro Toolchain Binaries, a pre-built version of
Linaro GCC and Linaro GDB that runs on generic Linux or Windows and
targets the glibc Linaro Evaluation Build.
Uses include:
* Cross compiling ARM applications from your laptop
* Remote debugging
* Build the Linux kernel for your board
What's included:
* Linaro GCC 4.7 2012.11
* Linaro GDB 7.5 2012.09
* A statically linked gdbserver
* A system root
* Manuals under share/doc/
The system root contains the basic header files and libraries to link
your programs against.
The Linux version is supported on Ubuntu 10.04.3 and 12.04, Debian
6.0.2, Fedora 16, openSUSE 12.1, Red Hat Enterprise Linux Workstation
5.7 and later, and should run on any Linux Standard Base 3.0
compatible distribution. Please see the README about running on
x86_64 hosts.
The Windows version is supported on Windows XP Pro SP3, Windows Vista
Business SP2, and Windows 7 Pro SP1.
The binaries and build scripts are available from:
https://launchpad.net/linaro-toolchain-binaries/trunk/2012.11
Need help? Ask a question on https://ask.linaro.org/
Already on Launchpad? Submit a bug at
https://bugs.launchpad.net/linaro-toolchain-binaries
On IRC? See us on #linaro on Freenode.
Other ways that you can contact us or get involved are listed at
https://wiki.linaro.org/GettingInvolved.
Summary:
* Investigate shrink-wrap result.
* Prepare for Linaro toolchain binary release, script merge and aarch64 test.
Details:
1. Investigate shrink-wrap result of function Ray_In_Bound. By
default, ARM/MIPS/PPC/X86 toolchain can not shrink-wrap the function.
For ARM, there is copy "r6 = r1" which blocks the optimization. By
hacking the assemble code, I got ~3% performance improvement for
453.povray benchmark.
2. Setup AARCH64 simulation environment by following
http://www.linaro.org/engineering/armv8.
3. Write scripts to collect branch cost performance. It will take
weeks to get all the benchmark results.
4. Smoke test Linaro toolchain binaries 2012.11 release.
5. Try export crosstool-ng trunk to a bzr project. bzr fast-import
always fail on Ubuntu 10.04, but it works on 12.04.
6. RM toolchain related work.
Plans:
* Collect performance data for branch cost tuning.
* Linaro binary toolchain 2012.11 release.
* Verify shrink-wrap bugs.
Best regards!
-Zhenqiang