On Fri, Aug 07, 2020 at 09:52:14AM -0700, Marc Plumb wrote:
On 2020-08-07 12:03 a.m., Willy Tarreau wrote:
Just to give a heads up on this, here's what I'm having pending regarding MSWS:
struct rnd_state { uint64_t x, w; uint64_t seed; uint64_t noise; };
uint32_t msws32(struct rnd_state *state) { uint64_t x;
x = state->w += state->seed; x += state->x * state->x; x = state->x = (x >> 32) | (x << 32); x -= state->noise++; return x ^ (x >> 32);
}
A few comments:
This is still another non-cryptographic PRNG.
Absolutely. During some discussions regarding the possibility of using CSPRNGs, orders around hundreds of CPU cycles were mentioned for them, which can definitely be a huge waste of precious resources for some workloads, possibly causing the addition of a few percent extra machines in certain environments just to keep the average load under a certain threshold. And the worst there is that such workloads would exactly be the ones absolutely not affected by the theorical attacks regarding predictability :-/
An LFSR can pass PractRand (if you do a tweak to get around the specific linear complexity test for LFSRs).
OK.
On a 64-bit machine it should be fast: 4 adds, 1 multiply, 1 rotate, 1 shift, 1 xor
This will be much slower on 32-bit machines, if that's still a concern
Yep, that's something I'm totally aware of and one reason I wanted to have a public discussion about this. My personal view on this is that a 64-bit multiply will always be cheaper than a crypto operation and that the environments where picking a source port or accepting a connection matters that much are not those running on such low-end machines, so that's very likely an acceptable tradeoff.
As long as the noise is the output of a CPRNG, this doesn't hurt the security of dev/dandom.
The noise is more like effective 32-bits since you're xoring the low and high half of the noise together (ignoring the minor details of carry bits).
Definitely. The purpose of this noise in fact is more to complicate the reconstruction of the internal state in case a large number of samples is used, precisely due to these carry bits that propagate solely depending on the noise value itself, and the uncertainty about when it changes and from what to what.
Which means that it's 2^32 effort to brute force this (which Amit called "no biggie for modern machines"). If the noise is the raw sample data with only a few bits of entropy, then it's even easier to brute force.
Don't you forget to multiply by another 2^32 for X being folded onto itself ? Because you have 2^32 possible values of X which will give you a single 32-bit output value for a given noise value.
Given the uses of this, I think we really should look into a CPRNG for this and then completely reseed it periodically. The problem is finding one that's fast enough.
That was precisely the problem that drove me to propose something like this. And all these cycles wasted everywhere just to protect a 16-bit source port on a mostly idle machine seem quite overkill to me.
Is there a hard instruction budget for this, or it is just "fast enough to not hurt the network benchmarks" (i.e. if Dave Miller screams)?
It's not just Davem. I too am concerned about wasting CPU cycles in fast path especially in the network code. A few half-percent gains are hardly won once in a while in this area and in some infrastructures they matter. Not much but they do.
Willy