The Linaro Toolchain Working Group is pleased to announce the release of both Linaro GCC 4.4 and Linaro GCC 4.5.
Linaro GCC 4.4 is the third release in the 4.4 series. Based off the latest GCC 4.4.4, it pulls in the pre-4.4.5 changes made by the FSF over the last six months.
Linaro GCC 4.5 is the second release in the 4.5 series. Based off the latest GCC 4.5.1, it finishes the merge of many ARM-focused performance improvements and bug fixes.
Interesting changes include: * Improved performance on the Cortex-A9 * Backports of a range of performance improvements from mainline * New inline versions of the GCC builtin sync primitives
Downloads are available from the Linaro GCC page on Launchpad: https://launchpad.net/gcc-linaro
Also available is an early release of optimised string routines for the Cortex-A series, including a mix of NEON and Thumb-2 versions of memcpy(), memset(), strcpy(), strcmp(), and strlen(). For more information see: https://launchpad.net/cortex-strings
Pre-build packages are available in the Linaro Toolchain PPA at: https://launchpad.net/~linaro-toolchain-dev/+archive/ppa
-- Michael
Hi,
Also available is an early release of optimised string routines for the Cortex-A series, including a mix of NEON and Thumb-2 versions of memcpy(), memset(), strcpy(), strcmp(), and strlen(). For more information see: https://launchpad.net/cortex-strings
My understanding is that the NEON optimisation will give some performance gain *ONLY* on Cortex-A8 but it will also burn more energy. On other CPU, e.g. Cortex-A9, there is no performance gain but still it will cost more energy. Linaro toolchain doesn't target a specific platform but is generic for armv7 platforms. Are you expecting to see those optimisations turned on in Linaro toolchain?
The NEON-optimised version is also beneficial for large copies, but it is not on short copies when the NEON unit has to be powered up (Linux kernel will get an exception to turn it on). I guess your benchmark didn't take that into account. Can the NEON-optimised version be changed so that it is not used for small copies?
Guillaume
-- IMPORTANT NOTICE: The contents of this email and any attachments are confidential and may also be privileged. If you are not the intended recipient, please notify the sender immediately and do not disclose the contents to any other person, use it for any purpose, or store or copy the information in any medium. Thank you.
On Wed, Sep 15, 2010 at 5:19 AM, Guillaume Letellier Guillaume.Letellier@arm.com wrote:
Hi,
Also available is an early release of optimised string routines for the Cortex-A series, including a mix of NEON and Thumb-2 versions of memcpy(), memset(), strcpy(), strcmp(), and strlen(). For more information see: https://launchpad.net/cortex-strings
My understanding is that the NEON optimisation will give some performance gain *ONLY* on Cortex-A8 but it will also burn more energy. On other CPU, e.g. Cortex-A9, there is no performance gain but still it will cost more energy.
I've heard that too but never had it confirmed. I will ask. The output of this project will be a set of routines specialised for Thumb-2, NEON, Cortex-A8, and Cortex-A9, where there is a benefit in doing variants for each. We need good non-NEON versions as NEON is optional and it can't be used in the Linux kernel.
Linaro toolchain doesn't target a specific platform but is generic for armv7 platforms. Are you expecting to see those optimisations turned on in Linaro toolchain?
Sorry, I don't understand the question. We want to spread these routines out and get them integrated into all of the upstream C libraries including NewLib, Bionic, and GLIBC.
The NEON-optimised version is also beneficial for large copies, but it is not on short copies when the NEON unit has to be powered up (Linux kernel will get an exception to turn it on). I guess your benchmark didn't take that into account. Can the NEON-optimised version be changed so that it is not used for small copies?
My understanding is that the NEON unit is on per process, so once you've turned it on once it should stay on. I assume the turn on cost is amortised across a run. Note that if the data is not in the L1 cache then the NEON unit wins even for small-ish (~64 byte) copies.
-- Michael
Linaro toolchain doesn't target a specific platform but is generic
for armv7 platforms. Are you expecting to see those optimisations turned on in Linaro toolchain?
Sorry, I don't understand the question. We want to spread these routines out and get them integrated into all of the upstream C libraries including NewLib, Bionic, and GLIBC.
My concern is that you want to spread it too widely! If the NEON-optimised memcpy() goes into GLIBC then I assume it will be used for any armv7 platforms (unless I'm mistaken you don't have a mechanism to detect whether GLIBC runs on a cortex-A8 or A9 And you don't have 2 different versions of the glibc library for the 2 CPUs) So this library might be good for the A8 but not the other CPUs.
My understanding is that the NEON unit is on per process, so once you've turned it on once it should stay on.
It's turned off by the kernel at context switch. For thread dealing with a lot of data, it make sense. Turning on NEON for a small copy doesn't make sense on embedded platforms.
I assume the turn on cost is amortised across a run. Note that if the data is not in the L1 cache then the NEON unit wins even for small-ish (~64 byte) copies.
Only on Cortex-A8. But still expensive power-wise.
Guillaume
-- IMPORTANT NOTICE: The contents of this email and any attachments are confidential and may also be privileged. If you are not the intended recipient, please notify the sender immediately and do not disclose the contents to any other person, use it for any purpose, or store or copy the information in any medium. Thank you.
On Wed, Sep 15, 2010 at 10:25 AM, Guillaume Letellier Guillaume.Letellier@arm.com wrote:
Linaro toolchain doesn't target a specific platform but is generic
for armv7 platforms. Are you expecting to see those optimisations turned on in Linaro toolchain?
Sorry, I don't understand the question. We want to spread these routines out and get them integrated into all of the upstream C libraries including NewLib, Bionic, and GLIBC.
My concern is that you want to spread it too widely! If the NEON-optimised memcpy() goes into GLIBC then I assume it will be used for any armv7 platforms (unless I'm mistaken you don't have a mechanism to detect whether GLIBC runs on a cortex-A8 or A9 And you don't have 2 different versions of the glibc library for the 2 CPUs) So this library might be good for the A8 but not the other CPUs.
GLIBC a mechanism for picking the best routines to use based on the CPU capabilities. This means that GLIBC can include A8 and A9 versions both with and without NEON, Ubuntu can ship all of these versions, and the dynamic linker can choose the best one based on the chip it is running on.
NewLib and Bionic are set at compile time but are normally used on a fixed platform.
I assume the turn on cost is amortised across a run. Note that if the data is not in the L1 cache then the NEON unit wins even for small-ish (~64 byte) copies.
Only on Cortex-A8. But still expensive power-wise.
Could you point me at a reference on the power consumption of the NEON unit? It will take power, but I don't know how much or how significant it is.
-- Michael
GLIBC a mechanism for picking the best routines to use based on the CPU capabilities. This means that GLIBC can include A8 and A9 versions both with and without NEON, Ubuntu can ship all of these versions, and the dynamic linker can choose the best one based on the chip it is running on.
I've had a quick look at HWCAP in eglibc and I didn't see it used anywhere. Anyway you can detect whether NEON is supported or not. But the CPU (A8 vs A9) is not detected. Therefore you can detect the cortext-A9 platforms supports NEON but it doesn't mean NEON memcpy is the best solution (actually the performance is not better and the power consumption is worse!)
Could you point me at a reference on the power consumption of the NEON unit? It will take power, but I don't know how much or how significant it is.
As far as I know there is no data on it (it's difficult to measure the consumption inside the SoC) Anyway the NEON unit is a big part of the CPU. So you can conclude it consumes a non negligible amount of energy.
Guillaume
-- IMPORTANT NOTICE: The contents of this email and any attachments are confidential and may also be privileged. If you are not the intended recipient, please notify the sender immediately and do not disclose the contents to any other person, use it for any purpose, or store or copy the information in any medium. Thank you.
On Wed, Sep 15, 2010, Michael Hope wrote:
GLIBC a mechanism for picking the best routines to use based on the CPU capabilities. This means that GLIBC can include A8 and A9 versions both with and without NEON, Ubuntu can ship all of these versions, and the dynamic linker can choose the best one based on the chip it is running on.
Are you sure we have an auxv entry for a8 versus a9? In any case, I doubt it's considered for glibc hwcaps right now as this requires explicit flagging and the list of ARM flags is quite short.
I don't think this would scale very well to multiple CPUs; it's not really a CPU feature we're after here, but a CPU characteristic.
GCC has a bunch of costing mechanism at compile time, and I think this shows that we need some at runtime, probably in all libcs. One way to workaround in the very short term would be to add some glibc config to turn on or off usage of NEON, or have glibc read cpuinfo or something to identify the CPU model and manufacturer.
Hi,
On Thu, Sep 16, 2010 at 11:43 AM, Loïc Minier loic.minier@linaro.org wrote:
[...]
Are you sure we have an auxv entry for a8 versus a9? In any case, I doubt it's considered for glibc hwcaps right now as this requires explicit flagging and the list of ARM flags is quite short.
Currently, I believe this information can't be retrieved via auxv. It is available-ish in /proc/cpuinfo.
I don't think this would scale very well to multiple CPUs; it's not really a CPU feature we're after here, but a CPU characteristic.
GCC has a bunch of costing mechanism at compile time, and I think this shows that we need some at runtime, probably in all libcs. One way to workaround in the very short term would be to add some glibc config to turn on or off usage of NEON, or have glibc read cpuinfo or something to identify the CPU model and manufacturer.
Indeed-- a generally applicable approach would be for libc to provide a few implementations, and for benchmarks to be run to choose which implementation to use. We could have preset defaults for some "common" CPU IDs such as Cortex-A[89] if we want, for when no benchmark result is available.
If we go down that road, note that the benchmark can be run quite offline, so it doesn't need to have any startup time impact for applications. But we'd need to put some thought into exactly how this is handled so that we don't end up with stale/wrong configurations.
One note of caution here: it's hard to take account of power when running a benchmark on target devices. So if benchmark speed is our only criterion we might make bad choices sometimes.
It would be kinda interesting if we could switch the memcpy implementation dynamically as processes execute (depending on the level of CPU load and whether we're trying to save power). This is perhaps a bit fanciful though--- it would be somewhat complex to set up and would require some extra support from libc. I don't know how beneficial it would be.
Cheers ---Dave
On Wed, Sep 15, 2010, Michael Hope wrote:
GLIBC a mechanism for picking the best routines to use based on the CPU capabilities. This means that GLIBC can include A8 and A9 versions both with and without NEON, Ubuntu can ship all of these versions, and the dynamic linker can choose the best one based on the chip it is running on.
Actually I understand STT_GNU_IFUNC would allow that, we just lack a good test
On Fri, Sep 17, 2010 at 11:31 PM, Loïc Minier loic.minier@linaro.org wrote:
On Wed, Sep 15, 2010, Michael Hope wrote:
GLIBC a mechanism for picking the best routines to use based on the CPU capabilities. This means that GLIBC can include A8 and A9 versions both with and without NEON, Ubuntu can ship all of these versions, and the dynamic linker can choose the best one based on the chip it is running on.
Actually I understand STT_GNU_IFUNC would allow that, we just lack a good test
Is STT_GNU_IFUNC implemented yet?
(I wasn't watching...)
Cheers, ---Dave
On Mon, 2010-09-20 at 09:14 +0100, Dave Martin wrote:
On Fri, Sep 17, 2010 at 11:31 PM, Loïc Minier loic.minier@linaro.org wrote:
On Wed, Sep 15, 2010, Michael Hope wrote:
GLIBC a mechanism for picking the best routines to use based on the CPU capabilities. This means that GLIBC can include A8 and A9 versions both with and without NEON, Ubuntu can ship all of these versions, and the dynamic linker can choose the best one based on the chip it is running on.
Actually I understand STT_GNU_IFUNC would allow that, we just lack a good test
Is STT_GNU_IFUNC implemented yet?
No. We need to sort out the ABI specs first.
R.
Hi,
On Mon, Sep 20, 2010 at 4:52 PM, Richard Earnshaw rearnsha@arm.com wrote:
[...]
Actually I understand STT_GNU_IFUNC would allow that, we just lack a good test
Is STT_GNU_IFUNC implemented yet?
No. We need to sort out the ABI specs first.
R.
Ah, I guess we need that first.
We could invent another mechanism (or maybe glibc already has one) ... but I guess it would be silly to invent something new if IFUNC will be usable soon.
---Dave
linaro-toolchain@lists.linaro.org