All,
In the Toolchain Working Group Mans has been doing some examination of SPEC 2000 and SPEC 2006 to see what C Library (glibc) routines impact performance the most, and are worth tuning.
This has come up with two areas we consider worthy of further investigation: 1) malloc performance 2) Floating-point rounding functions.
This email is interested with the first of these.
Analysis of malloc shows large amounts of time is spent in executing synchronization primitives even when the program under test is single-threaded.
The obvious 'fix' is to remove the synchronization primitives which will give a performance boost. This is, of course, not safe and will require reworking malloc's algorithms to be (substantially) synchronization free.
A quick Google suggests that there are better performing algorithms available (TCMalloc, Lockless, Hoard, &c), and so changing glibc's algorithm is something well worth investigating.
Currently we see around 4.37% of time being spent in libc for the whole of SPEC CPU 2006. Around 75% of that is in malloc related functions (so about 3.1% of the total). One benchmark however spends around 20% of its time in malloc. So overall we are looking at maybe 1% improvement in the SPEC 2006 score, which is not large given the amount of effort I estimate this is going to require (as we have to convince the community we have made everyone's life better).
So before we go any further I would like to see what the view of LEG is about a better malloc. My questions boil down to:
* Is malloc important - or do server applications just implement their own? * Do you have any benchmarks that stress malloc and would provide us with some more data points?
But any and all comments on the subject are welcome.
Thanks,
Matt
On 28 May 2013 12:03, Matthew Gretton-Dann matthew.gretton-dann@linaro.org wrote:
So before we go any further I would like to see what the view of LEG is about a better malloc. My questions boil down to:
- Is malloc important - or do server applications just implement their own?
The Apache web server manages its heap allocations with differently scoped pools, i.e., per-request, per-session, per-thread. This allows it for instance to free() a whole request pool at once when it has finished serving a request instead of going through each of the allocations and freeing them one by one. PHP's Zend core has an elaborate memory management layer which does some similar tricks.
That means I don't expect malloc() to be as dominant in a real-world *web* server application.
- Do you have any benchmarks that stress malloc and would provide us with
some more data points?
Not right now, but if we notice it anywhere near the top in the perf trace, we will let you know about it.
Matthew Gretton-Dann matthew.gretton-dann@linaro.org writes:
All,
[snip]
So before we go any further I would like to see what the view of LEG is about a better malloc. My questions boil down to:
- Is malloc important - or do server applications just implement their own?
I got sent this question and a list of "server applications" and did some investigation, both of typical runtimes and of the applications. Just based on source inspection and a little googling in some cases. Let me know if you want me to look into anything in more detail.
The answer is likely "sometimes" to both parts of the question :-)
These are my notes. Corrections welcome!
Runtimes ========
Perl ----
Uses glibc malloc (in practice -- it ships with its own malloc implementation but this is not used by default on Linux or in the Ubuntu builds)
Python ------
Uses its own allocator for small allocations, which are by far the commonest. Uses glibc malloc for some things (e.g. memory backing a list object), but malloc-related functions do not appear high up in perf traces.
Java ----
Very much does its own heap management.
PHP ---
As Ard says it has its own thing, and looking at its source, it clearly does something quite complicated (zend_alloc.c is nearly 3000 lines). It bundles various libraries (sqlite, pcre, ...) that do call malloc() and it doesn't seem like it tries to get those libraries to call its own implementation of malloc or anything like that -- so some workloads might benefit from malloc improvements.
Server processes ================
apache2 -------
As Ard says it has its own thing where it manages a pool per request. Looks like it calls malloc a fair bit though.
cassandra ---------
Uses the Java heap mostly, clearly. Does store a few things "off heap" (row cache, bloom filter bitsets, compression metadata), which uses sun.misc.unsafe.allocateMemory, which /probably/ backs onto glibc malloc but mostly I think these things are allocated once at process start up rather than in any hot path.
hadoop ------
Appears to have bits that call malloc. Hard to say more than that without inhaling the architecture more thoroughly.
ceph ----
Certainly calls malloc (and operator new) in many places. So potentially interesting.
memcached ---------
AFAICT, allocates one big chunk of memory with malloc and then does its own thing to divvy it up.
mongodb -------
AIUI, pushes the problem to the kernel by mmap()ing the data files into its address space and fooling around in there. So probably not dependent on malloc() performance.
swift -----
Seems to be pure Python, so not really dependent on malloc.
varnish -------
Calls malloc() once per request and allocates itself within that -- and on linux (incl Ubuntu armhf), it uses a bundled version of jemalloc for even that.
haproxy -------
I *think* this mostly uses a similar model to apache2/varnish: allocate a region once per request (there are quite a few other calls to malloc too -- I don't know if they are on hot paths or not though). It does just use glibc malloc to allocate this memory though AFAICT.
tomcat7 -------
Just uses the Java heap afaict (I guess the contained JSPs can use JNI or whatever but it looks like the container doesn't).
- Do you have any benchmarks that stress malloc and would provide us with
some more data points?
But any and all comments on the subject are welcome.
It seems perl and ceph almost certainly have a dependency on glibc malloc performance. In most other cases, it seems that projects that have noticed that malloc can be a little slow have implemented their own solutions. It might be that an improved system malloc would mean that some of these could stop using their own implementation, but often times they are exploiting properties a system malloc simply cannot (e.g. allocating an arena per-request and then throwing it all away in one big go).
Cheers, mwh
linaro-toolchain@lists.linaro.org