== Progress ==
* Connect last week.
* Worked through the open issues and open work items related to performance and we've got a clear list of things that are currently in flight. Now to keep track of this better. https://wiki.linaro.org/RamanaRadhakrishnan/Sandbox//RRQ212ConnectNotes and move this away from the wiki page in a form that we can use to talk during our regular performance meetings. * Created blueprints, closed down old issues and reprioritized issues with Ulrich and others. * A number of interesting conversations during Connect for a number of compiler related issues. * Other sessions that I attended included the Android optimizations sessions - while there was quite a bit about toolchain performance it is important that we keep looking out for the performance profiles and find areas where the toolchain can be improved. However this can't be done without getting more testcases from other groups. There were a couple of interesting comments made that skia is CPU bound which would indicate that the paint function is CPU bound. But why and how ? Someone should look at reproducing these numbers and see where we get to in this area. Pointed out that cortex-strings might be good to make it into bionic ? * Fixed the vrev off by one error and committed to FSF trunk . However it couldn't make it in time for FSF 4.7.1 as the merge window had closed by then. * Set up my panda board to be identical to what runs on our validation labs etc.
* This week
* Worked through the merge requests and moved some patches upstream away from the "toreview" state. * Landed a few merge requests that were approved but hadn't been done so. Took care of merging the upstream 4.7 branch. * Given I only had a few hours back in the office this week I worked on regenerating arm_neon.h to use __builtin_shuffle with vrev64, vrev32, vtrn , vzip and vuzp. A follow up patch needs to do the same for vext but that needs generic support also in vec_perm_const_ok .Once that is done I think we can safely start rewriting . It still needs some more testing and polishing up but the initial results on the testcase from PR48941 is kind of neat. The result for some of the other testcases that I've looked at also looks much better than where we were a few weeks back. So all in all nice progress on that front. However we have to also find a way of getting these generated at O0 which they don't appear to do so cleanly enough with this approach.
for one example it does look like this below: Notice those spills beginning to disappear .... :)
New :
sqrlen4D_16u8: @ args = 0, pretend = 0, frame = 0 @ frame_needed = 0, uses_anonymous_args = 0 @ link register save eliminated. vabd.u8 q1, q0, q1 vmull.u8 q0, d2, d2 vmull.u8 q8, d3, d3 vuzp.32 q0, q8 vpaddl.u16 q0, q0 vpadal.u16 q0, q8 bx lr
Old :
sqrlen4D_16u8: @ args = 0, pretend = 0, frame = 0 @ frame_needed = 1, uses_anonymous_args = 0 @ link register save eliminated. vabd.u8 q1, q0, q1 stmfd sp!, {r4, fp} add fp, sp, #4 sub sp, sp, #48 add r3, sp, #15 vmull.u8 q0, d2, d2 bic r3, r3, #15 vmull.u8 q8, d3, d3 vuzp.32 q0, q8 vstmia r3, {d0-d1} vstr d16, [r3, #16] vstr d17, [r3, #24] vpaddl.u16 q0, q0 vpadal.u16 q0, q8 sub sp, fp, #4 ldmfd sp!, {r4, fp} bx lr
* Attended platform / WG sync-up.
== Plans ==
* Cleanup the ml bits of rewiring the intrinsics and try some proper testcases. * Work on the auto-inc-dec scheduler patches. * Rework the sched-pressure patch upstream . * Review the Android benchmarking writeups.