On Aug 7, 2020, at 12:21 PM, Linus Torvalds torvalds@linux-foundation.org wrote:
On Fri, Aug 7, 2020 at 12:08 PM Andy Lutomirski luto@amacapital.net wrote:
4 cycles per byte on Core 2
I took the reference C implementation as-is, and just compiled it with O2, so my numbers may not be what some heavily optimized case does.
But it was way more than that, even when amortizing for "only need to do it every 8 cases". I think the 4 cycles/byte might be some "zero branch mispredicts" case when you've fully unrolled the thing, but then you'll be taking I$ misses out of the wazoo, since by definition this won't be in your L1 I$ at all (only called every 8 times).
Sure, it might look ok on microbenchmarks where it does stay hot the cache all the time, but that's not realistic. I
No one said we have to do only one ChaCha20 block per slow path hit. In fact, the more we reduce the number of rounds, the more time we spend on I$ misses, branch mispredictions, etc, so reducing rounds may be barking up the wrong tree entirely. We probably don’t want to have more than one page
I wonder if AES-NI adds any value here. AES-CTR is almost a drop-in replacement for ChaCha20, and maybe the performance for a cache-cold short run is better.