Re: RFC: Dynamic hwcaps

7 Dec 2010


      Hi,
On Tue, Dec 7, 2010 at 1:02 AM, Mark Mitchell mark@codesourcery.com wrote:
...
On 12/6/2010 5:07 AM, Dave Martin wrote:
...
...
But,
to enable binary distribution, having to have N copies of a library (let
alone an application) for N different ARM core variants just doesn't
make sense to me.
Just so, and as discussed before improvements to package managers
could help here to avoid installing duplicate libraries.  (I believe
that rpm may have some capability here (?) but deb does not at
present).
Yes, a smarter package manager could help a device builder automatically
get the right version of a library.  But, something more fundamental has
to happen to avoid the library developer having to *produce* N versions
of a library.  (Yes, in theory, you just type "make" with different
CFLAGS options, but in practice of course it's often more complex than
that, especially if you need to validate the library.)
Yes-- though I didn't elaborate on it.  You need a packager that can
understand, say, that a binary built for ARMv5 EABI can interoperate
with ARMv7 binaries etc.
Again, I've heard it suggested that RPM can handle this, but I haven't
looked at it in detail myself.
...
...
Currently, I don't have many examples-- the main one is related to the
discussions aroung using NEON for memcpy().  This can be a performance
win on some platforms, but except when the system is heavily loaded,
or when NEON happens to be turned on anyway, it may not be
advantageous for the user or overall system performance.
How good of a proxy would the length of the copy be, do you think?  If
you want to copy 1G of data, and NEON makes you 2x-4x faster, then it
seems to me that you probably want to use NEON, almost independent of
overall system load.  But, if you're only going to copy 16 bytes, even
if NEON is faster, it's probably OK not to use it -- the function-call
overhead to get into memcpy at all is probably significant relative to
the time you'd save by using NEON.  In between, it's harder, of course
-- but perhaps if memcpy is the key example, we could get 80% of the
benefit of your idea simply by a test inside memcpy as to the length of
the data to be copied?
For the memcpy() case, the answer is probably yes, though how often
memcpy is called by a given thread is also of significance.
However, there's still a problem: NEON is not designed for
implementing memcpy(), so there's no guarantee that it will always be
faster ... it is on some SoCs in some situations, but much less
beneficial on others -- the "sweet spots" both for performance and
power may differ widely from core to core and from SoC to SoC.  So
running benchmarks on one or two boards and then hard-compiling some
thresholds into glibc may not be the right approach.  Also, gcc
implements memcpy directly too for some cases (but only for small
copies?)
The dynamic hwcaps approach doesn't really solve that problem: for
adapting to different SoCs, you really want a way to run a benchmark
on the target to make your decision (xine-lib chooses an internal
memcpy implementation this way for example), or a way to pass some
platform metrics to glibc / other affected libraries.  Identifying the
precise SoC from /proc/cpuinfo isn't always straightforward, but I've
seen some code making use of it in similar ways.
Cheers
---Dave

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

Re: RFC: Dynamic hwcaps