On Fri, Aug 7, 2020 at 10:55 AM Andy Lutomirski luto@amacapital.net wrote:
I think the real random.c can run plenty fast. It’s ChaCha20 plus ludicrous overhead right now.
I doubt it.
I tried something very much like that in user space to just see how many cycles it ended up being.
I made a "just raw ChaCha20", and it was already much too slow for what some of the networking people claim to want.
And maybe they are asking for too much, but if they think it's too slow, they'll not use it, and then we're back to square one.
Now, what *might* be acceptable is to not do ChaCha20, but simply do a single double-round of it.
So after doing 10 prandom_u32() calls, you'd have done a full ChaCha20. I didn't actually try that, but from looking at the costs from trying the full thing, I think it might be in the right ballpark.
How does that sound to people?
Linus
On Aug 7, 2020, at 11:10 AM, Linus Torvalds torvalds@linux-foundation.org wrote:
On Fri, Aug 7, 2020 at 10:55 AM Andy Lutomirski luto@amacapital.net wrote:
I think the real random.c can run plenty fast. It’s ChaCha20 plus ludicrous overhead right now.
I doubt it.
I tried something very much like that in user space to just see how many cycles it ended up being.
I made a "just raw ChaCha20", and it was already much too slow for what some of the networking people claim to want.
Do you remember the numbers?
Certainly a full ChaCha20 per random number is too much, but AFAICT the network folks want 16 or 32 bits at a time, which is 1/16 or 1/8 of a ChaCha20. DJB claims 4 cycles per byte on Core 2, and it had better be faster now, although we can’t usefully use XMM regs, so I don’t know the real timings.
But with the current code, the actual crypto will be lost in the noise. That’s what I’m trying to fix.
Now, what *might* be acceptable is to not do ChaCha20, but simply do a single double-round of it.
We can certainly have a parallel RNG seeded by the main RNG that runs fewer rounds. I’ll do that if benchmarks say I’m still too slow.
All of this is trivial except the locking. If I’m writing this code, I personally refuse to use the “races just make it more random” strategy. I’m going to do it without data races, and this will take a bit of work.
On Fri, Aug 7, 2020 at 12:08 PM Andy Lutomirski luto@amacapital.net wrote:
On Aug 7, 2020, at 11:10 AM, Linus Torvalds torvalds@linux-foundation.org wrote:
I tried something very much like that in user space to just see how many cycles it ended up being.
I made a "just raw ChaCha20", and it was already much too slow for what some of the networking people claim to want.
Do you remember the numbers?
Sorry, no. I wrote a hacky thing in user space, and threw it away.
Certainly a full ChaCha20 per random number is too much, but AFAICT the network folks want 16 or 32 bits at a time, which is 1/16 or 1/8 of a ChaCha20.
That's what I did (well, I did just the 32-bit one), basically emulating percpu accesses for incrementing the offset (I didn't actually *do* percpu accesses, I just did a single-threaded run and used globals, but wrote it with wrappers so that it would look like it might work).
DJB claims 4 cycles per byte on Core 2
I took the reference C implementation as-is, and just compiled it with O2, so my numbers may not be what some heavily optimized case does.
But it was way more than that, even when amortizing for "only need to do it every 8 cases". I think the 4 cycles/byte might be some "zero branch mispredicts" case when you've fully unrolled the thing, but then you'll be taking I$ misses out of the wazoo, since by definition this won't be in your L1 I$ at all (only called every 8 times).
Sure, it might look ok on microbenchmarks where it does stay hot the cache all the time, but that's not realistic. I
Linus
On Aug 7, 2020, at 12:21 PM, Linus Torvalds torvalds@linux-foundation.org wrote:
On Fri, Aug 7, 2020 at 12:08 PM Andy Lutomirski luto@amacapital.net wrote:
4 cycles per byte on Core 2
I took the reference C implementation as-is, and just compiled it with O2, so my numbers may not be what some heavily optimized case does.
But it was way more than that, even when amortizing for "only need to do it every 8 cases". I think the 4 cycles/byte might be some "zero branch mispredicts" case when you've fully unrolled the thing, but then you'll be taking I$ misses out of the wazoo, since by definition this won't be in your L1 I$ at all (only called every 8 times).
Sure, it might look ok on microbenchmarks where it does stay hot the cache all the time, but that's not realistic. I
No one said we have to do only one ChaCha20 block per slow path hit. In fact, the more we reduce the number of rounds, the more time we spend on I$ misses, branch mispredictions, etc, so reducing rounds may be barking up the wrong tree entirely. We probably don’t want to have more than one page
I wonder if AES-NI adds any value here. AES-CTR is almost a drop-in replacement for ChaCha20, and maybe the performance for a cache-cold short run is better.
On Fri, Aug 7, 2020 at 12:33 PM Andy Lutomirski luto@amacapital.net wrote:
No one said we have to do only one ChaCha20 block per slow path hit.
Sure, doing more might be better for amortizing the cost.
But you have to be very careful about latency spikes. I would be *really* nervous about doing a whole page at a time, when this is called from routines that literally expect it to be less than 50 cycles.
So I would seriously suggest you look at a much smaller buffer. Maybe not a single block, but definitely not multiple kB either.
Maybe something like 2 cachelines might be ok, but there's a reason the current code only works with 16 bytes (or whatever) and only does simple operations with no looping.
That's why I think you might look at a single double-round ChaCha20 instead. Maybe do it for two blocks - by the time you wrap around, you'll have done more than a full ChaCaa20.
That would imnsho *much* better than doing some big block, and have huge latency spikes and flush a large portion of your L1 when they happen. Nasty nasty behavior.
I really think the whole "we can amortize it with bigger blocks" is complete and utter garbage. It's classic "benchmarketing" crap.
Linus
On Aug 7, 2020, at 12:57 PM, Linus Torvalds torvalds@linux-foundation.org wrote:
On Fri, Aug 7, 2020 at 12:33 PM Andy Lutomirski luto@amacapital.net wrote:
No one said we have to do only one ChaCha20 block per slow path hit.
Sure, doing more might be better for amortizing the cost.
But you have to be very careful about latency spikes. I would be *really* nervous about doing a whole page at a time, when this is called from routines that literally expect it to be less than 50 cycles.
So I would seriously suggest you look at a much smaller buffer. Maybe not a single block, but definitely not multiple kB either.
Maybe something like 2 cachelines might be ok, but there's a reason the current code only works with 16 bytes (or whatever) and only does simple operations with no looping.
That's why I think you might look at a single double-round ChaCha20 instead. Maybe do it for two blocks - by the time you wrap around, you'll have done more than a full ChaCaa20.
That would imnsho *much* better than doing some big block, and have huge latency spikes and flush a large portion of your L1 when they happen. Nasty nasty behavior.
I really think the whole "we can amortize it with bigger blocks" is complete and utter garbage. It's classic "benchmarketing" crap.
I think this will come down to actual measurements :). If the cost of one block of cache-cold ChaCha20 is 100 cycles of actual computation and 200 cycles of various cache misses, then let’s do more than one block.
I’ll get something working and we’ll see.
On Fri, Aug 7, 2020 at 1:16 PM Andy Lutomirski luto@amacapital.net wrote:
I think this will come down to actual measurements :).
Numbers talk, bullshit walks.
But:
If the cost of one block of cache-cold ChaCha20 is 100 cycles of actual computation and 200 cycles of various cache misses, then let’s do more than one block.
That's *completely* nonsensical thinking and crazy talk.
If the issue is latency (and for a lot of networking, that literally *is* the issue), your mental model is completely insane.
"Oh, it's so expensive that we should queue *more*" is classic throughput thinking, and it's wrong.
If you have performance problems, you should look really really hard at fixing latency.
Because fixing latency helps throughput. But fixing throughput _hurts_ latency.
It really is that simple. Latency is the much more important thing to optimize.
Nobody cares about "bulk TCP sequence numbers". Sure, you'll find benchmarks for it, because it's a lot easier to benchmark throughput. But what people _care_ about is generally latency.
Linus
linux-stable-mirror@lists.linaro.org