Hi!
I've attempted to study the implementation of memcpy for 32-bit Arm cores in Glibc (which is also found in arm-optimized-routines and first appeared in Linaro's cortex-strings project), and I came across a peculiar snippet:
#ifdef USE_VFP /* Magic dust alert! Force VFP on Cortex-A9. Experiments show that the FP pipeline is much better at streaming loads and stores. This is outside the critical loop. */ vmov.f32 s0, s0 #endif
This seems to imply that this NOP-like instruction affects CPU state and makes the vldr/vstr instructions that follow use different datapaths that they might otherwise? Can anyone shed more light on this, please?
I was able to trace history of this code back to revision 100 in cortex-strings repository, where it appeared as part of a large rewrite by Will Newton:
https://bazaar.launchpad.net/~linaro-toolchain-dev/cortex-strings/trunk/revi...
The entire memcpy.S file in Arm optimized-routines repo can be found here:
https://github.com/ARM-software/optimized-routines/blob/master/string/arm/me...
Thanks! Alexander
On 27/11/2019 12:03, Alexander Monakov wrote:
Hi!
I've attempted to study the implementation of memcpy for 32-bit Arm cores in Glibc (which is also found in arm-optimized-routines and first appeared in Linaro's cortex-strings project), and I came across a peculiar snippet:
#ifdef USE_VFP /* Magic dust alert! Force VFP on Cortex-A9. Experiments show that the FP pipeline is much better at streaming loads and stores. This is outside the critical loop. */ vmov.f32 s0, s0 #endif
This seems to imply that this NOP-like instruction affects CPU state and makes the vldr/vstr instructions that follow use different datapaths that they might otherwise? Can anyone shed more light on this, please?
I was able to trace history of this code back to revision 100 in cortex-strings repository, where it appeared as part of a large rewrite by Will Newton: https://bazaar.launchpad.net/~linaro-toolchain-dev/cortex-strings/trunk/revi...
The entire memcpy.S file in Arm optimized-routines repo can be found here:
https://github.com/ARM-software/optimized-routines/blob/master/string/arm/me...
Unfortunately this snippet did not raise any question in patch revision [1].
My guess after consulting the "Cortex-A9 NEON Media Processing Engine" manual [2] is since the Cortex-A9 processor (implemented with the MPE) contains distinct data-processing units for integer operation, Advanced-SIMD, and VFP (page 3-19) is to force the usage of VFP data-process unit.
However both the vldr and vstr are described in the manual as:
Name Advanced SIMD VFP Description VLDR X F, D Load Single Register VSTR X F, D Store Register
Meaning that the vldr/vstr usage in the below in the loop should exercises the Advanced SIMD.
I couldn't find any information on how the data-processing unit is selected on A9 technical manual site [3], neither how previous instructions could influence.
[1] https://patches.linaro.org/patch/16133/ [2] http://infocenter.arm.com/help/topic/com.arm.doc.ddi0409g/DDI0409G_cortex_a9... [3] http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.subset.cortexa.a...
linaro-toolchain@lists.linaro.org