Malloc usage - linaro-toolchain

28 May 2013


      All,
In the Toolchain Working Group Mans has been doing some examination of SPEC 
2000 and SPEC 2006 to see what C Library (glibc) routines impact performance 
the most, and are worth tuning.
This has come up with two areas we consider worthy of further investigation:
  1) malloc performance
  2) Floating-point rounding functions.
This email is interested with the first of these.
Analysis of malloc shows large amounts of time is spent in executing 
synchronization primitives even when the program under test is single-threaded.
The obvious 'fix' is to remove the synchronization primitives which will 
give a performance boost.  This is, of course, not safe and will require 
reworking malloc's algorithms to be (substantially) synchronization free.
A quick Google suggests that there are better performing algorithms 
available (TCMalloc, Lockless, Hoard, &c), and so changing glibc's algorithm 
is something well worth investigating.
Currently we see around 4.37% of time being spent in libc for the whole of 
SPEC CPU 2006.  Around 75% of that is in malloc related functions (so about 
3.1% of the total).  One benchmark however spends around 20% of its time in 
malloc.  So overall we are looking at maybe 1% improvement in the SPEC 2006 
score, which is not large given the amount of effort I estimate this is 
going to require (as we have to convince the community we have made 
everyone's life better).
So before we go any further I would like to see what the view of LEG is 
about a better malloc.  My questions boil down to:
* Is malloc important - or do server applications just implement their own?
  * Do you have any benchmarks that stress malloc and would provide us with 
some more data points?
But any and all comments on the subject are welcome.
Thanks,
Matt
-- 
Matthew Gretton-Dann
Toolchain Working Group, Linaro