The unaligned accesses in libpng are, for the large copies, a bug. Our attempt to align the row buffer to a 16 byte boundary was off-by-one so we end up always mis-aligning it. I've posted a patch on the png-mng-implement list:
http://sourceforge.net/mailarchive/message.php?msg_id=28194444
The time spent in memcpy() is probably an illusion. The data out of zlib gets copied to one row buffer where it is unfiltered (if necessary) then a copy is made in a separate buffer that is only used for the filter handling. If you test using images with large rows (I don't know what pngbench does) the copy buffer may well get flushed out of the second level cache between each row, then the memcpy will stall bringing it back in.
If you have machine level profiling you may see this as a massive time spike on some probably unrelated instruction which just happens to be in the PC when the stall stops everything.
Anyway, I have several ideas of how to avoid the copy when it isn't required.
John Bowler jbowler@acm.org
-----Original Message----- From: Glenn Randers-Pehrson [mailto:glennrp@gmail.com] Sent: Monday, October 03, 2011 1:15 PM To: PNG/MNG implementation discussion list Subject: [png-mng-implement] Use of memcpy() in libpng [Fwd from linaro-toolchain list]
Re: Use of memcpy() in libpng
David Gilbert Tue, 27 Sep 2011 06:20:14 -0700
On 27 September 2011 14:16, Christian Robottom Reis k...@linaro.org wrote:
On Tue, Sep 27, 2011 at 09:47:33AM +0100, Ramana Radhakrishnan wrote:
On 26 September 2011 21:51, Michael Hope michael.h...@linaro.org wrote:
Saw this on the linaro-multimedia list: http://lists.linaro.org/pipermail/linaro-multimedia/2011-September/ 000074.html
libpng spends a significant amount of time in memcpy(). This might tie in with Ramana's investigation or the unaligned access work by allowing more memcpy()s to be inlined.
It's the unaligned access and the change / improvements to the memcpy that *might* help in this case. But that ofcourse depends on the compiler knowing when it can do such a thing. Ofcourse what might be more interesting is the kind of workload analysis that Dave's done in the past with memcpy to know what the alignment and size of the buffer being copied is.
If you guys could take a look at this there is a potential requirement for the MMWG around libpng optimization; we could fit this in along with other work (possible vectorizing, etc) on that component.
It wouldn't take long to analyse the memcpy calls - life would be easier if we had the test program and some details on things like what size of images were used in these benchmarks.
Dave