Hi Marc,
On Tue, Aug 04, 2020 at 05:52:36PM -0700, Marc Plumb wrote:
Seeding two PRNGs with the same entropy causes two problems. The minor one is that you're double counting entropy. The major one is that anyone who can determine the state of one PRNG can determine the state of the other.
The net_rand_state PRNG is effectively a 113 bit LFSR, so anyone who can see any 113 bits of output can determine the complete internal state.
The output of the net_rand_state PRNG is used to determine how data is sent to the network, so the output is effectively broadcast to anyone watching network traffic. Therefore anyone watching the network traffic can determine the seed data being fed to the net_rand_state PRNG.
The problem this patch is trying to work around is that the reporter (Amit) was able to determine the entire net_rand_state after observing a certain number of packets due to this trivial LFSR and the fact that its internal state between two reseedings only depends on the number of calls to read it. (please note that regarding this point I'll propose a patch to replace that PRNG to stop directly exposing the internal state to the network).
If you look closer at the patch, you'll see that in one interrupt the patch only uses any 32 out of the 128 bits of fast_pool to update only 32 bits of the net_rand_state. As such, the sequence observed on the network also depends on the remaining bits of net_rand_state, while the 96 other bits of the fast_pool are not exposed there.
Since this is the same seed data being fed to get_random_bytes, it allows an attacker to determine the state and there output of /dev/random. I sincerely hope that this was not the intended goal. :)
Not only was this obviously not the goal, but I'd be particularly interested in seeing this reality demonstrated, considering that the whole 128 bits of fast_pool together count as a single bit of entropy, and that as such, even if you were able to figure the value of the 32 bits leaked to net_rand_state, you'd still have to guess the 96 other bits for each single entropy bit :-/
Regards, Willy
On Wed, Aug 05, 2020 at 04:49:41AM +0200, Willy Tarreau wrote:
Not only was this obviously not the goal, but I'd be particularly interested in seeing this reality demonstrated, considering that the whole 128 bits of fast_pool together count as a single bit of entropy, and that as such, even if you were able to figure the value of the 32 bits leaked to net_rand_state, you'd still have to guess the 96 other bits for each single entropy bit :-/
Not only that, you'd have to figure out which 32-bits in the fast_pool actually had gotten leaked to the net_rand_state.
I agree with Willy that I'd love to see an exploit since it would probably give a lot of insights. Maybe at a Crypto rump session once it's safe to have those sorts of things again. :-)
That being said, it certainly is a certificational / theoretical weakness, and if the bright boys and girls at Fort Meade did figure out a way to exploit this, they are very much unlikely to share it at an open Crypto conference. So replacing LFSR-based PRnG with something stronger which didn't release any bits from the fast_pool would certainly be desireable, and I look forward to seeing what Willy has in mind.
Cheers,
- Ted
Hi Ted,
On 2020-08-05 8:34 a.m., tytso@mit.edu wrote:
On Wed, Aug 05, 2020 at 04:49:41AM +0200, Willy Tarreau wrote:
Not only was this obviously not the goal, but I'd be particularly interested in seeing this reality demonstrated, considering that the whole 128 bits of fast_pool together count as a single bit of entropy, and that as such, even if you were able to figure the value of the 32 bits leaked to net_rand_state, you'd still have to guess the 96 other bits for each single entropy bit :-/
Not only that, you'd have to figure out which 32-bits in the fast_pool actually had gotten leaked to the net_rand_state.
That's 2 bits which are already inputs to the fast_pool, so it doesn't even make a brute force any more difficult.
I agree with Willy that I'd love to see an exploit since it would probably give a lot of insights. Maybe at a Crypto rump session once it's safe to have those sorts of things again. :-)
Just because you or I don't have a working exploit doesn't mean that someone else isn't more clever. It pays to be paranoid about cryptographic primitives and there is nothing more important than the entropy pool.
So replacing LFSR-based PRnG with something stronger which didn't release any bits from the fast_pool would certainly be desireable, and I look forward to seeing what Willy has in mind.
Isn't get_random_u32 the function you wrote to do that? If this needs to be cryptographically secure, that's an existing option that's safe.
The fundamental question is: Why is this attack on net_rand_state problem? It's Working as Designed. Why is it a major enough problem to risk harming cryptographically important functions?
Do you remember how you resisted making dev/urandom fast for large reads for a long time to punish stupid uses of the interface? In this case anyone who is using net_rand_state assuming it is a CPRNG should stop doing that. Don't enable stupidity in the kernel.
This whole thing is making the fundamental mistake of all amateur cryptographers of trying to create your own cryptographic primitive. You're trying to invent a secure stream cipher. Either don't try to make net_rand_state secure, or use a known secure primitive.
Thanks,
Marc
Hi Mark,
On Wed, Aug 05, 2020 at 09:06:40AM -0700, Marc Plumb wrote:
Just because you or I don't have a working exploit doesn't mean that someone else isn't more clever.
I agree on the principle, but it can be said from many things, including our respective inability to factor large numbers for example. But for sure we do need to be careful, and actually picking only some limited parts of the fast pool (which are only used to update the input pool and are only made of low-difficulty stuff like instruction pointers, jiffies and TSC values) is probably not going to disclose an extremely well guarded secret.
The fundamental question is: Why is this attack on net_rand_state problem? It's Working as Designed. Why is it a major enough problem to risk harming cryptographically important functions?
It's not *that* major an issue (in my personal opinion) but the current net_rand_state is easy enough to guess so that an observer may reduce the difficulty to build certain attacks (using known source ports for example). The goal of this change (and the one in update_process_times()) is to disturb the net_rand_state a little bit so that external observations turn from "this must be that" to "this may be this or maybe that", which is sufficient to limit the ability to reliably guess a state and reduce the cost of an attack.
Another approach involving the replacement of the algorithm was considered but we were working with -stable in mind, trying to limit the backporting difficulty (and it revealed a circular dependency nightmare that had been sleeping there for years), and making the changes easier to check (which is precisely what you're doing).
Do you remember how you resisted making dev/urandom fast for large reads for a long time to punish stupid uses of the interface? In this case anyone who is using net_rand_state assuming it is a CPRNG should stop doing that. Don't enable stupidity in the kernel.
This whole thing is making the fundamental mistake of all amateur cryptographers of trying to create your own cryptographic primitive. You're trying to invent a secure stream cipher. Either don't try to make net_rand_state secure, or use a known secure primitive.
We're not trying to invent any stream cipher or whatever, just making use of a few bits that are never exposed alone as-is to internal nor external states, to slightly disturb another state that otherwise only changes once a minute so that there's no more a 100% chance of guessing a 16-bit port after seeing a few packets. I mean, I'm pretty sure that even stealing three or four bits only from there would be quite enough to defeat the attack given that Amit only recovers a few bits per packet.
For me the right longterm solution will be to replace the easily guessable LFSR. But given the build breakage we got by just adding one include, I can only guess what we'll see when trying to do more in this area :-/
Regards, Willy
Willy,
On 2020-08-05 12:38 p.m., Willy Tarreau wrote:
It's not *that* major an issue (in my personal opinion) but the current net_rand_state is easy enough to guess so that an observer may reduce the difficulty to build certain attacks (using known source ports for example). The goal of this change (and the one in update_process_times()) is to disturb the net_rand_state a little bit so that external observations turn from "this must be that" to "this may be this or maybe that", which is sufficient to limit the ability to reliably guess a state and reduce the cost of an attack.
There is nothing wrong with perturbing net_rand_state, the sin is doing it with the raw entropy that is also the seed for your CPRNG. Use the output of a CPRNG to perturb the pool all you want, but don't do things that bit by bit reveal the entropy that is being fed into the CPRNG.
Another approach involving the replacement of the algorithm was considered but we were working with -stable in mind, trying to limit the backporting difficulty (and it revealed a circular dependency nightmare that had been sleeping there for years), and making the changes easier to check (which is precisely what you're doing).
Really? You can replace the LFSR and only change lib/random32.c. That might fix the problem without the need to use the raw fast_pool data for seed material. As Linus said, speed is a concern but SFC32 is faster than existing Tausworthe generator, and it's a drop-in replacement with the same state size if that makes your life easier. If you're willing to expand the state there are even better options (like SFC64 or some of chaotic generators like Jenkins' Frog).
We're not trying to invent any stream cipher or whatever, just making use of a few bits that are never exposed alone as-is to internal nor external states, to slightly disturb another state that otherwise only changes once a minute so that there's no more a 100% chance of guessing a 16-bit port after seeing a few packets. I mean, I'm pretty sure that even stealing three or four bits only from there would be quite enough to defeat the attack given that Amit only recovers a few bits per packet.
If Amit's attack can recover that much per packet (in practice not just in theory) and there is even one packet per interrupt, then it's a major problem. There are at most 2 bits of entropy added per call to add_interrupt_randomness, so it you're leaking "a few bits per packet" then that's everything. Over 64 interrupts you've lost the 128 bits of entropy that the fast_pool has spilled into the input_pool.
Marc
On Wed, Aug 05, 2020 at 03:21:11PM -0700, Marc Plumb wrote:
There is nothing wrong with perturbing net_rand_state, the sin is doing it with the raw entropy that is also the seed for your CPRNG. Use the output of a CPRNG to perturb the pool all you want, but don't do things that bit by bit reveal the entropy that is being fed into the CPRNG.
This is interesting because I think some of us considered it exactly the other way around, i.e. we're not copying exact bits but just taking a pseudo-random part of such bits at one point in time, to serve as an increment among other ones. And given that these bits were collected over time from not very secret sources, they appeared to be of lower risk than the output.
I mean, if we reimplemented something in parallel just mixing the IRQ return pointer and TSC, some people could possibly say "is this really strong enough?" but it wouldn't seem very shocking in terms of disclosure. But if by doing so we ended up in accident reproducing the same contents as the fast_pool it could be problematic.
Would you think that using only the input info used to update the fast_pool would be cleaner ? I mean, instead of :
fast_pool->pool[0] ^= cycles ^ j_high ^ irq; fast_pool->pool[1] ^= now ^ c_high; ip = regs ? instruction_pointer(regs) : _RET_IP_; fast_pool->pool[2] ^= ip; fast_pool->pool[3] ^= (sizeof(ip) > 4) ? ip >> 32 : get_reg(fast_pool, regs);
we'd do:
x0 = cycles ^ j_high ^ irq; x1 = now ^ c_high; x2 = regs ? instruction_pointer(regs) : _RET_IP_; x3 = (sizeof(ip) > 4) ? ip >> 32 : get_reg(fast_pool, regs);
fast_pool->pool[0] ^= x0; fast_pool->pool[1] ^= x1; fast_pool->pool[2] ^= x2; fast_pool->pool[3] ^= x3;
this_cpu_add(net_rand_state.s1, x0^x1^x2^x3);
Because that's something easy to do and providing the same level of perturbation to net_rand_state.
Another approach involving the replacement of the algorithm was considered but we were working with -stable in mind, trying to limit the backporting difficulty (and it revealed a circular dependency nightmare that had been sleeping there for years), and making the changes easier to check (which is precisely what you're doing).
Really? You can replace the LFSR and only change lib/random32.c. That might fix the problem without the need to use the raw fast_pool data for seed material.
Definitely. I had been working on a variant of MSWS (https://arxiv.org/pdf/1704.00358.pdf) that I discussed a little bit with Amit, but I didn't publish it yet since it needs to be cleaned up and to replace all the test patterns in random32.c. And when starting to rebase it I figured that updating it from the interrupt handler didn't really look like it brought anything. I mean that when you're starting to wonder what to update "to make it better", it likely means that you don't need it.
As Linus said, speed is a concern but SFC32 is faster than existing Tausworthe generator, and it's a drop-in replacement with the same state size if that makes your life easier. If you're willing to expand the state there are even better options (like SFC64 or some of chaotic generators like Jenkins' Frog).
I didn't know about SFC32, it looks like a variation of the large family of xorshift generators, which is thus probably quite suitable as well for this task. Having used xoroshiro128** myself in another project, I found it overkill for this task compared to MSWS but I definitely agree that any of them is more suited to the task than the current one.
We're not trying to invent any stream cipher or whatever, just making use of a few bits that are never exposed alone as-is to internal nor external states, to slightly disturb another state that otherwise only changes once a minute so that there's no more a 100% chance of guessing a 16-bit port after seeing a few packets. I mean, I'm pretty sure that even stealing three or four bits only from there would be quite enough to defeat the attack given that Amit only recovers a few bits per packet.
If Amit's attack can recover that much per packet (in practice not just in theory) and there is even one packet per interrupt, then it's a major problem. There are at most 2 bits of entropy added per call to add_interrupt_randomness, so it you're leaking "a few bits per packet" then that's everything. Over 64 interrupts you've lost the 128 bits of entropy that the fast_pool has spilled into the input_pool.
But how do you *figure* the value of these bits and their position in the fast_pool ? I mean that the attack relies on net_rand_state not being perturbed. Let's assume you'd be able to brute-force them in linear time by drawing the list of possible candidates for 32-bit increment on each packet and you end up with 64 sets of candidates. You'll probably have quite a number of candidates there since you'll also need to take into account the CPU the packet ends up on and calls to update_process_times() on each jiffy. For each of these candidates you don't precisely know what part of the fast_pool was used so you're at 4^64 combinations, and even if you can guess which ones they are, they come from different generations of the fast_pool, which are iterated over on other activities.
Don't get me wrong, I'm not trying to defend the solution or anything, I find it more like a reasonable short term plaster to at least address lab-style attacks on almost idle machines. But outside of the lab, when machines are doing real work, I agree with Linus that both the attack and the risks mentioned above fly into pieces because there are a lot more possibilities at each step of the operations that have to be taken into account. Now if you think that doing some small changes like proposed above just to better protect the entropy would provide significant help, feel free to suggest so.
Thanks, Willy
Willy,
On 2020-08-05 11:30 p.m., Willy Tarreau wrote:
On Wed, Aug 05, 2020 at 03:21:11PM -0700, Marc Plumb wrote:
There is nothing wrong with perturbing net_rand_state, the sin is doing it with the raw entropy that is also the seed for your CPRNG. Use the output of a CPRNG to perturb the pool all you want, but don't do things that bit by bit reveal the entropy that is being fed into the CPRNG.
This is interesting because I think some of us considered it exactly the other way around, i.e. we're not copying exact bits but just taking a pseudo-random part of such bits at one point in time, to serve as an increment among other ones. And given that these bits were collected over time from not very secret sources, they appeared to be of lower risk than the output.
No. The output of a CPRNG can't be used to determine the internal state. The input can. The input entropy is the one thing that cannot be produced by a deterministic computer, so they are the crown jewels of this. It's much much safer to use the output.
I mean, if we reimplemented something in parallel just mixing the IRQ return pointer and TSC, some people could possibly say "is this really strong enough?" but it wouldn't seem very shocking in terms of disclosure. But if by doing so we ended up in accident reproducing the same contents as the fast_pool it could be problematic.
Would you think that using only the input info used to update the fast_pool would be cleaner ? I mean, instead of :
fast_pool->pool[0] ^= cycles ^ j_high ^ irq; fast_pool->pool[1] ^= now ^ c_high; ip = regs ? instruction_pointer(regs) : _RET_IP_; fast_pool->pool[2] ^= ip; fast_pool->pool[3] ^= (sizeof(ip) > 4) ? ip >> 32 : get_reg(fast_pool, regs);
we'd do:
x0 = cycles ^ j_high ^ irq; x1 = now ^ c_high; x2 = regs ? instruction_pointer(regs) : _RET_IP_; x3 = (sizeof(ip) > 4) ? ip >> 32 : get_reg(fast_pool, regs); fast_pool->pool[0] ^= x0; fast_pool->pool[1] ^= x1; fast_pool->pool[2] ^= x2; fast_pool->pool[3] ^= x3;
this_cpu_add(net_rand_state.s1, x0^x1^x2^x3);
No. That's just as bad. There are two major problems:
It takes the entropy and sends it to the outside world without any strong crypto between the seed and the output. Reversing this isn't trivial, but it also isn't provably difficult.
It adds small amounts of entropy at a time and exposes it to the outside world. No crypto can make this safe (google "catastrophic reseeding"). If an attacker can guess the time within 1ms, then on a 4GHz CPU that's only 22 bits of uncertainty, so it's possible to brute force the input. Any details about which part of the fast_pool are used are irrelevant since that's determined by that input also, so it adds no security to this type of brute force attack. The only other part is the initial TSC offset, but if that were sufficient we wouldn't need the reseeding at all.
I didn't know about SFC32, it looks like a variation of the large family of xorshift generators, which is thus probably quite suitable as well for this task. Having used xoroshiro128** myself in another project, I found it overkill for this task compared to MSWS but I definitely agree that any of them is more suited to the task than the current one.
It's actually a chaotic generator (not a linear one like an xorshift generator), which gives it weaker period guarantees which makes it more difficult to reverse. With a counter added to help the period length.
I'll trust Amit that SFC32 isn't strong enough and look at other options -- I just thought of it as better, and faster than the existing one with the same state size. Maybe a larger state is needed.
Thanks,
Marc
On Thu, Aug 06, 2020 at 10:18:55AM -0700, Marc Plumb wrote:
Willy,
On 2020-08-05 11:30 p.m., Willy Tarreau wrote:
On Wed, Aug 05, 2020 at 03:21:11PM -0700, Marc Plumb wrote:
There is nothing wrong with perturbing net_rand_state, the sin is doing it with the raw entropy that is also the seed for your CPRNG. Use the output of a CPRNG to perturb the pool all you want, but don't do things that bit by bit reveal the entropy that is being fed into the CPRNG.
This is interesting because I think some of us considered it exactly the other way around, i.e. we're not copying exact bits but just taking a pseudo-random part of such bits at one point in time, to serve as an increment among other ones. And given that these bits were collected over time from not very secret sources, they appeared to be of lower risk than the output.
No. The output of a CPRNG can't be used to determine the internal state. The input can. The input entropy is the one thing that cannot be produced by a deterministic computer, so they are the crown jewels of this. It's much much safer to use the output.
OK, noted.
I didn't know about SFC32, it looks like a variation of the large family of xorshift generators, which is thus probably quite suitable as well for this task. Having used xoroshiro128** myself in another project, I found it overkill for this task compared to MSWS but I definitely agree that any of them is more suited to the task than the current one.
It's actually a chaotic generator (not a linear one like an xorshift generator), which gives it weaker period guarantees which makes it more difficult to reverse. With a counter added to help the period length.
I'll trust Amit that SFC32 isn't strong enough and look at other options -- I just thought of it as better, and faster than the existing one with the same state size. Maybe a larger state is needed.
Just to give a heads up on this, here's what I'm having pending regarding MSWS:
struct rnd_state { uint64_t x, w; uint64_t seed; uint64_t noise; };
uint32_t msws32(struct rnd_state *state) { uint64_t x;
x = state->w += state->seed; x += state->x * state->x; x = state->x = (x >> 32) | (x << 32); x -= state->noise++; return x ^ (x >> 32); }
It passes PractRand without any warning after 1 TB of data:
rng=RNG_stdin, seed=unknown length= 512 megabytes (2^29 bytes), time= 2.0 seconds no anomalies in 229 test result(s)
rng=RNG_stdin, seed=unknown length= 1 gigabyte (2^30 bytes), time= 4.3 seconds no anomalies in 248 test result(s)
rng=RNG_stdin, seed=unknown length= 2 gigabytes (2^31 bytes), time= 8.3 seconds no anomalies in 266 test result(s)
rng=RNG_stdin, seed=unknown length= 4 gigabytes (2^32 bytes), time= 15.8 seconds no anomalies in 282 test result(s)
rng=RNG_stdin, seed=unknown length= 8 gigabytes (2^33 bytes), time= 31.3 seconds no anomalies in 299 test result(s)
rng=RNG_stdin, seed=unknown length= 16 gigabytes (2^34 bytes), time= 61.9 seconds no anomalies in 315 test result(s)
rng=RNG_stdin, seed=unknown length= 32 gigabytes (2^35 bytes), time= 119 seconds no anomalies in 328 test result(s)
rng=RNG_stdin, seed=unknown length= 64 gigabytes (2^36 bytes), time= 242 seconds no anomalies in 344 test result(s)
rng=RNG_stdin, seed=unknown length= 128 gigabytes (2^37 bytes), time= 483 seconds no anomalies in 359 test result(s)
rng=RNG_stdin, seed=unknown length= 256 gigabytes (2^38 bytes), time= 940 seconds no anomalies in 372 test result(s)
rng=RNG_stdin, seed=unknown length= 512 gigabytes (2^39 bytes), time= 1906 seconds no anomalies in 387 test result(s)
rng=RNG_stdin, seed=unknown length= 1 terabyte (2^40 bytes), time= 3826 seconds no anomalies in 401 test result(s)
The two modifications compared to the original msws are:
- mix bits on output so that we don't reveal the internal state upon each call ;
- combination of the output with an independent noise variable whose purpose was to be updated upon IRQ and/or CPU usage and/or invocations. But on this point, while implementing it I figured that updating it on each invocation did already provide the frequent updates we were missing in Tausworthe that required the interrupt updates. I'd definitely update in update_process_times() so that it's not reduced to a pure counter, but the results, speed and simplicity look encouraging.
I'll try to work on finishing the patch proposal this week-end.
Regards, Willy
On 2020-08-07 12:03 a.m., Willy Tarreau wrote:
Just to give a heads up on this, here's what I'm having pending regarding MSWS:
struct rnd_state { uint64_t x, w; uint64_t seed; uint64_t noise; };
uint32_t msws32(struct rnd_state *state) { uint64_t x;
x = state->w += state->seed; x += state->x * state->x; x = state->x = (x >> 32) | (x << 32); x -= state->noise++; return x ^ (x >> 32);
}
A few comments:
This is still another non-cryptographic PRNG. An LFSR can pass PractRand (if you do a tweak to get around the specific linear complexity test for LFSRs).
On a 64-bit machine it should be fast: 4 adds, 1 multiply, 1 rotate, 1 shift, 1 xor
This will be much slower on 32-bit machines, if that's still a concern
As long as the noise is the output of a CPRNG, this doesn't hurt the security of dev/dandom.
The noise is more like effective 32-bits since you're xoring the low and high half of the noise together (ignoring the minor details of carry bits). Which means that it's 2^32 effort to brute force this (which Amit called "no biggie for modern machines"). If the noise is the raw sample data with only a few bits of entropy, then it's even easier to brute force.
Given the uses of this, I think we really should look into a CPRNG for this and then completely reseed it periodically. The problem is finding one that's fast enough. Is there a hard instruction budget for this, or it is just "fast enough to not hurt the network benchmarks" (i.e. if Dave Miller screams)?
Marc
On Fri, Aug 07, 2020 at 09:52:14AM -0700, Marc Plumb wrote:
On 2020-08-07 12:03 a.m., Willy Tarreau wrote:
Just to give a heads up on this, here's what I'm having pending regarding MSWS:
struct rnd_state { uint64_t x, w; uint64_t seed; uint64_t noise; };
uint32_t msws32(struct rnd_state *state) { uint64_t x;
x = state->w += state->seed; x += state->x * state->x; x = state->x = (x >> 32) | (x << 32); x -= state->noise++; return x ^ (x >> 32);
}
A few comments:
This is still another non-cryptographic PRNG.
Absolutely. During some discussions regarding the possibility of using CSPRNGs, orders around hundreds of CPU cycles were mentioned for them, which can definitely be a huge waste of precious resources for some workloads, possibly causing the addition of a few percent extra machines in certain environments just to keep the average load under a certain threshold. And the worst there is that such workloads would exactly be the ones absolutely not affected by the theorical attacks regarding predictability :-/
An LFSR can pass PractRand (if you do a tweak to get around the specific linear complexity test for LFSRs).
OK.
On a 64-bit machine it should be fast: 4 adds, 1 multiply, 1 rotate, 1 shift, 1 xor
This will be much slower on 32-bit machines, if that's still a concern
Yep, that's something I'm totally aware of and one reason I wanted to have a public discussion about this. My personal view on this is that a 64-bit multiply will always be cheaper than a crypto operation and that the environments where picking a source port or accepting a connection matters that much are not those running on such low-end machines, so that's very likely an acceptable tradeoff.
As long as the noise is the output of a CPRNG, this doesn't hurt the security of dev/dandom.
The noise is more like effective 32-bits since you're xoring the low and high half of the noise together (ignoring the minor details of carry bits).
Definitely. The purpose of this noise in fact is more to complicate the reconstruction of the internal state in case a large number of samples is used, precisely due to these carry bits that propagate solely depending on the noise value itself, and the uncertainty about when it changes and from what to what.
Which means that it's 2^32 effort to brute force this (which Amit called "no biggie for modern machines"). If the noise is the raw sample data with only a few bits of entropy, then it's even easier to brute force.
Don't you forget to multiply by another 2^32 for X being folded onto itself ? Because you have 2^32 possible values of X which will give you a single 32-bit output value for a given noise value.
Given the uses of this, I think we really should look into a CPRNG for this and then completely reseed it periodically. The problem is finding one that's fast enough.
That was precisely the problem that drove me to propose something like this. And all these cycles wasted everywhere just to protect a 16-bit source port on a mostly idle machine seem quite overkill to me.
Is there a hard instruction budget for this, or it is just "fast enough to not hurt the network benchmarks" (i.e. if Dave Miller screams)?
It's not just Davem. I too am concerned about wasting CPU cycles in fast path especially in the network code. A few half-percent gains are hardly won once in a while in this area and in some infrastructures they matter. Not much but they do.
Willy
On 2020-08-07 10:43 a.m., Willy Tarreau wrote:
Which means that it's 2^32 effort to brute force this (which Amit called "no biggie for modern machines"). If the noise is the raw sample data with only a few bits of entropy, then it's even easier to brute force.
Don't you forget to multiply by another 2^32 for X being folded onto itself ? Because you have 2^32 possible values of X which will give you a single 32-bit output value for a given noise value.
If I can figure the state out once, then the only new input is the noise, so that's the only part I have to brute force. Throwing the noise in makes it more difficult to get that state once, but once I have it then this type of reseeding doesn't help.
Is there a hard instruction budget for this, or it is just "fast enough to not hurt the network benchmarks" (i.e. if Dave Miller screams)?
It's not just Davem. I too am concerned about wasting CPU cycles in fast path especially in the network code. A few half-percent gains are hardly won once in a while in this area and in some infrastructures they matter. Not much but they do.
That's why I was asking. I don't have the same experience as you for what acceptable is. I think it might be possible to do a decent CPRNG (that's at least had some cryptanalys of it) with ~20 instructions per word, but if that's not fast enough then I'll think about other options.
Marc
On Fri, Aug 07, 2020 at 12:59:48PM -0700, Marc Plumb wrote:
On 2020-08-07 10:43 a.m., Willy Tarreau wrote:
Which means that it's 2^32 effort to brute force this (which Amit called "no biggie for modern machines"). If the noise is the raw sample data with only a few bits of entropy, then it's even easier to brute force.
Don't you forget to multiply by another 2^32 for X being folded onto itself ? Because you have 2^32 possible values of X which will give you a single 32-bit output value for a given noise value.
If I can figure the state out once,
Yes but how do you take that as granted ? This state doesn't appear without its noise counterpart, so taking as a prerequisite that you may guess one separately obviously indicates that you then just have to deduce the other, but the point of mixing precisely is that we do not expose individual parts.
This way of thinking is often what results in extreme solutions to be designed, which are far away from the reality of the field of application, and result in unacceptable costs that make people turn to other solutions. Do you think it makes me happy to see people waste their time reimplementing alternate userland TCP stacks that are supposedly "way faster" by getting rid of all the useless (to them) stuff that was forced on them at the cost of performance ? And it makes me even less happy when they ask me why I'm not spending more time trying to adopt them. The reality is that this time could be better spent optimizing our stack to be sure that costs are added where they are relevant, and not just to make sure that when we align 7 conditions starting with "imagine that I could guess that", the 8th could be guessed as well, except that none of these can really be guessed outside of a lab :-/
then the only new input is the noise, so that's the only part I have to brute force. Throwing the noise in makes it more difficult to get that state once, but once I have it then this type of reseeding doesn't help.
I think it might be possible to do a decent CPRNG (that's at least had some cryptanalys of it) with ~20 instructions per word, but if that's not fast enough then I'll think about other options.
I think that around 20 instructions for a hash would definitely be nice (but please be aware that we're speaking about RISC-like instructions, not SIMD instructions). And also please be careful not to count only with amortized performance that's only good to show nice openssl benchmarks, because if that's 1280 instructions for 256 bits that result in 20 instructions per 32-bit word, it's not the same anymore at all!
Regards, Willy
Willy,
On 2020-08-07 3:19 p.m., Willy Tarreau wrote:
On Fri, Aug 07, 2020 at 12:59:48PM -0700, Marc Plumb wrote:
If I can figure the state out once,
Yes but how do you take that as granted ? This state doesn't appear without its noise counterpart,
Amit has shown attacks that can deduce the full internal state from 4-5 packets with a weak PRNG. If the noise is added less often than that, an attacker can figure out the entire state at which point the partial reseeding doesn't help. If the noise is added more often than that, and it's raw timing events, then it's only adding a few bits of entropy so its easy to guess (and it weakens dev/random). If the noise is added more often, and it's from the output of a CPRNG, then we have all the performance/latency problems from the CPRNG itself, so we might as well use it directly.
I think it might be possible to do a decent CPRNG (that's at least had some cryptanalys of it) with ~20 instructions per word, but if that's not fast enough then I'll think about other options.
I think that around 20 instructions for a hash would definitely be nice (but please be aware that we're speaking about RISC-like instructions, not SIMD instructions). And also please be careful not to count only with amortized performance that's only good to show nice openssl benchmarks, because if that's 1280 instructions for 256 bits that result in 20 instructions per 32-bit word, it's not the same anymore at all!
Understood.
Marc
On Fri, Aug 07, 2020 at 03:45:48PM -0700, Marc Plumb wrote:
Willy,
On 2020-08-07 3:19 p.m., Willy Tarreau wrote:
On Fri, Aug 07, 2020 at 12:59:48PM -0700, Marc Plumb wrote:
If I can figure the state out once,
Yes but how do you take that as granted ? This state doesn't appear without its noise counterpart,
Amit has shown attacks that can deduce the full internal state from 4-5 packets with a weak PRNG. If the noise is added less often than that, an attacker can figure out the entire state at which point the partial reseeding doesn't help. If the noise is added more often than that, and it's raw timing events, then it's only adding a few bits of entropy so its easy to guess (and it weakens dev/random). If the noise is added more often, and it's from the output of a CPRNG, then we have all the performance/latency problems from the CPRNG itself, so we might as well use it directly.
His attacks is based on the fact that internal state bits leak as-is. So it is possible to collect them and perform the inverse operation to reconstruct the input.
Here the output is made by mixing two input bits with two noise bits and produce one single output bit. So for each output bit you see, you don't have just one possible input bit, but 16 possibilities. That's 128 bits of internal changing states that once combined result in 32 bits. For each 32-bit output you still have 2^96 possible internal (x,noise) states producing it. And that's without counting on the 64-bit seed that you still don't know but can be deduced from two guessed 128 bit states (assuming that can be brute-forced at all, of course).
For sure there are plenty of number theories allowing you to significantly reduce the space you need to work on to brute-force this but it will definitely remain above 2^32 attempts for each iteration, which is the floor you have without the noise part, while the whole internal state will be reseeded every minute anyway.
I mean, I'm not trying to sell this thing, I'm just trying to defend that we use a reasonable tool for a reasonable protection level. And yes, probably that 15 years from now, someone will say "hey look, I can brute-force this thing in less than a minute on my 1024-core 39th gen core i7 machine with 2^60 operations per second, why don't we use our CPU's native 10 picosecond AES1024 instruction instead ?" and we'll say "well, it's an old story and it's probably about time to change it again".
I don't see anything wrong with evolving this way, matching concrete needs more than pure theory.
Regards, Willy
On Wed, Aug 05, 2020 at 09:06:40AM -0700, Marc Plumb wrote:
Isn't get_random_u32 the function you wrote to do that? If this needs to be cryptographically secure, that's an existing option that's safe.
The fundamental question is: Why is this attack on net_rand_state problem? It's Working as Designed. Why is it a major enough problem to risk harming cryptographically important functions?
I haven't looked at the users of net_rand_state, but historically, the networking subsystem has a expressed a (perceived?) need for *very* fast mostly-random-but-if-doens't-have-to-be-perfect-numbers-our-benchmarks- are-way-more-important numbers. As in, if there are extra cache line misses, our benchmarks would suffer and that's not acceptable.
One of the problems here is that it's not sufficient for the average case to be fast, but once in every N operations, we need to do something that requires Real Crypto, and so that Nth time, there would be an extra lag and that would be the end of the world (at least as far as networking benchmarks are concerned, anyway). So in other words, it's not enough for the average time to run get_random_u32() to be fast, they care about the 95th or 99th percentile number of get_random_u32() to be fast as well.
An example of this would be for TCP sequence number generation; it's not *really* something that needs to be secure, and if we rekey the RNG every 5 minutes, so long as the best case attack takes at most, say, an hour, if the worst the attacker can do is to be able to carry out an man-in-the-middle attack without being physically in between the source and the destination --- well, if you *really* cared about security the TCP connection would be protected using TLS anyway. See RFC 1948 (later updated by RFC 6528) for an argument along these lines.
This whole thing is making the fundamental mistake of all amateur cryptographers of trying to create your own cryptographic primitive. You're trying to invent a secure stream cipher. Either don't try to make net_rand_state secure, or use a known secure primitive.
Well, technically it's not supposed to be a secure cryptographic primitive. net_rand_state is used in the call prandom_u32(), so the only supposed guarantee is PSEUDO random.
That being said, a quick "get grep prandom_u32" shows that there are a *huge* number of uses of prandom_u32() and whether they are all appropriate uses of prandom_u32(), or kernel developers are using it because "I haz a ne3D for spE3d" but in fact it's for a security critical application is a pretty terrifying question. If we start seeing CVE's getting filed caused by inappropriate uses of prandom_u32, to be honest, it won't surprise me.
- Ted
On Aug 5, 2020, at 3:06 PM, tytso@mit.edu wrote:
On Wed, Aug 05, 2020 at 09:06:40AM -0700, Marc Plumb wrote: Isn't get_random_u32 the function you wrote to do that? If this needs to be cryptographically secure, that's an existing option that's safe. The fundamental question is: Why is this attack on net_rand_state problem? It's Working as Designed. Why is it a major enough problem to risk harming cryptographically important functions?
I haven't looked at the users of net_rand_state, but historically, the networking subsystem has a expressed a (perceived?) need for *very* fast mostly-random-but-if-doens't-have-to-be-perfect-numbers-our-benchmarks- are-way-more-important numbers. As in, if there are extra cache line misses, our benchmarks would suffer and that's not acceptable.
One of the problems here is that it's not sufficient for the average case to be fast, but once in every N operations, we need to do something that requires Real Crypto, and so that Nth time, there would be an extra lag and that would be the end of the world (at least as far as networking benchmarks are concerned, anyway).
I respectfully disagree with the supposed network people :). I’m working, slowly, on a patch set to make this genuinely fast.
So in other words, it's not enough for the average time to run get_random_u32() to be fast, they care about the 95th or 99th percentile number of get_random_u32() to be fast as well.
An example of this would be for TCP sequence number generation; it's not *really* something that needs to be secure, and if we rekey the RNG every 5 minutes, so long as the best case attack takes at most, say, an hour, if the worst the attacker can do is to be able to carry out an man-in-the-middle attack without being physically in between the source and the destination --- well, if you *really* cared about security the TCP connection would be protected using TLS anyway. See RFC 1948 (later updated by RFC 6528) for an argument along these lines.
This whole thing is making the fundamental mistake of all amateur cryptographers of trying to create your own cryptographic primitive. You're trying to invent a secure stream cipher. Either don't try to make net_rand_state secure, or use a known secure primitive.
Well, technically it's not supposed to be a secure cryptographic primitive. net_rand_state is used in the call prandom_u32(), so the only supposed guarantee is PSEUDO random.
That being said, a quick "get grep prandom_u32" shows that there are a *huge* number of uses of prandom_u32() and whether they are all appropriate uses of prandom_u32(), or kernel developers are using it because "I haz a ne3D for spE3d" but in fact it's for a security critical application is a pretty terrifying question. If we start seeing CVE's getting filed caused by inappropriate uses of prandom_u32, to be honest, it won't surprise me.
- Ted
On 2020-08-05 3:05 p.m., tytso@mit.edu wrote:
Well, technically it's not supposed to be a secure cryptographic primitive. net_rand_state is used in the call prandom_u32(), so the only supposed guarantee is PSEUDO random.
That being said, a quick "get grep prandom_u32" shows that there are a *huge* number of uses of prandom_u32() and whether they are all appropriate uses of prandom_u32(), or kernel developers are using it because "I haz a ne3D for spE3d" but in fact it's for a security critical application is a pretty terrifying question. If we start seeing CVE's getting filed caused by inappropriate uses of prandom_u32, to be honest, it won't surprise me.
The danger I'm worried about it's misuse of prandom_u32. That would mean one function would have weak random numbers. I'm worried about the disclosure of the entropy that is the basis for the good random numbers because that would undermine the security of the people who are using the right functions for their task.
Having said that, auditing all uses of prandom_u32 would be useful, but a different issue.
On Wed, Aug 5, 2020 at 6:07 PM tytso@mit.edu wrote:
That being said, it certainly is a certificational / theoretical weakness
random.c is filled with super suspicious things that are probably only correct by accident, or only correct in practice, but in theory it's just such a mess. Stupid example if I'm remembering correctly: you fill the sha1 IV with input from rdrand; if rdrand is borked or backdoored or whatever, then the security of sha1 there reduces to shacal1, which isn't totally broken, far from it actually, so we're fine there, but you can't help but look at that and say "ugh." I'll rewrite that eventually. Anyway, having another "certificational weakness", as you put it, that isn't a practical one would be par for the course with random.c
, and if the bright boys and girls at Fort Meade did figure out a way to exploit this, they are very much unlikely to share it at an open Crypto conference. So replacing LFSR-based PRnG with something stronger which didn't release any bits from the fast_pool would certainly be desireable, and I look forward to seeing what Willy has in mind.
This disaster is partially my fault. I was going to make get_random_u{32,64} fast enough that we could get rid of the fake rng stuff, and then Covid things got weird and I haven't gotten refocused on that yet. Andy started that, and I was supposed to pick up his work and complete it, but I dropped the ball. I kept meaning to get back to that, but I'd get discouraged every time I saw Willy sending more messages about improving the fake rng stuff with more fake rng stuff. But, seems like it's time for me to step on the pedal a bit and get that working. So hopefully we'll be able to get rid of the code in question here, and use a good rng everywhere. I'll send an update on that if I get it working.
Jason
Hi Ted,
On Wed, Aug 05, 2020 at 11:34:32AM -0400, tytso@mit.edu wrote:
That being said, it certainly is a certificational / theoretical weakness, and if the bright boys and girls at Fort Meade did figure out a way to exploit this, they are very much unlikely to share it at an open Crypto conference. So replacing LFSR-based PRnG with something stronger which didn't release any bits from the fast_pool would certainly be desireable, and I look forward to seeing what Willy has in mind.
I'll post a proposal patch shortly about this, hopefully this week-end (got diverted by work lately :-)). Just to give you a few pointers, it's a small modification of MSWS. It passes the Practrand test suite on 256 GB of data with zero warning (something that Tausworthe is supposed to fail at).
By default, MSWS *does* leak its internal state, as Amit showed us (and seeing that the paper on it suggests it's safe as-is for crypto use is a bit shocking), but once slightly adjusted, it doesn't reveal its state anymore and that would constitute a much more future-proof solution for quite some time. Tausworthe was created something like 20 years ago or so, hence it's not surprizing that it's a bit dated by now, but if we can upgrade once every 2 decades I guess it's not that bad.
Cheers, Willy
Hi Willy,
On 2020-08-04 7:49 p.m., Willy Tarreau wrote:
Hi Marc,
On Tue, Aug 04, 2020 at 05:52:36PM -0700, Marc Plumb wrote:
Seeding two PRNGs with the same entropy causes two problems. The minor one is that you're double counting entropy. The major one is that anyone who can determine the state of one PRNG can determine the state of the other.
The net_rand_state PRNG is effectively a 113 bit LFSR, so anyone who can see any 113 bits of output can determine the complete internal state.
The output of the net_rand_state PRNG is used to determine how data is sent to the network, so the output is effectively broadcast to anyone watching network traffic. Therefore anyone watching the network traffic can determine the seed data being fed to the net_rand_state PRNG.
The problem this patch is trying to work around is that the reporter (Amit) was able to determine the entire net_rand_state after observing a certain number of packets due to this trivial LFSR and the fact that its internal state between two reseedings only depends on the number of calls to read it.
I thought net_rand_state was assumed to be insecure and that anyone could determine the internal state. Isn't this Working as Designed? It's a PRNG not a CPRNG. If some aspects of security depends on the sate remaining secret then this is fundamentally the wrong tool.
(please note that regarding this point I'll propose a patch to replace that PRNG to stop directly exposing the internal state to the network).
I'm glad to hear that. A good option would be SFC32.
If you look closer at the patch, you'll see that in one interrupt the patch only uses any 32 out of the 128 bits of fast_pool to update only 32 bits of the net_rand_state. As such, the sequence observed on the network also depends on the remaining bits of net_rand_state, while the 96 other bits of the fast_pool are not exposed there.
The fast pool contains 128 bits of state, not 128 bits of entropy. The purpose of the large pool size is to make sure that the entropy is not lost due to collisions. The non-linear mixing function (a simplified version of a single round of the ChaCha mixing function so the entropy diffusion is low) means that the 32 bits leaked are not independent from the other 96 bits, and in fact you can reconstruct the entire pool from a single reading of 32 bits (as long as there aren't more than 32 bits of entropy added during this time -- which isn't the case, see below). Please think harder about this part. I think you are misunderstanding how this code works.
Since this is the same seed data being fed to get_random_bytes, it allows an attacker to determine the state and there output of /dev/random. I sincerely hope that this was not the intended goal. :)
Not only was this obviously not the goal, but I'd be particularly interested in seeing this reality demonstrated, considering that the whole 128 bits of fast_pool together count as a single bit of entropy, and that as such, even if you were able to figure the value of the 32 bits leaked to net_rand_state, you'd still have to guess the 96 other bits for each single entropy bit :-/
The code assumes that there is at least 1/64 bit of entropy per sample, and at most 2 bits of entropy per sample (which is why it dumps 128 bits every 64 samples). If you're extracting 32 bits every sample, which means it's leaking 2048 bits in 64 samples (to net_random, how fast it leaks to the outside world is a different issue). So the question is if an attacker can reconstruct 128 bits of entropy from 2048 bits of internal state -- this does not seem obviously impossible, since there are no cryptographically vetted operations in this.
The other thing that this misses is that reseeding in dribs and drabs makes it much easier to recover the new state. This is is explained well in the documentation about catastrophic reseeding in the Fortuna CPRNG. This entire approach is flawed. Throwing in single bits of entropy at a time doesn't really help since an attacker can brute force a single bit at a time. (There is a challenge in deriving the initial fast_pool states which this makes more difficult, but there is no real crytptographic guarantee about how difficult it is.)
Thanks,
Marc
On Wed, Aug 5, 2020 at 8:44 AM Marc Plumb lkml.mplumb@gmail.com wrote:
I thought net_rand_state was assumed to be insecure and that anyone could determine the internal state. Isn't this Working as Designed?
I was working as designed - because it wasn't really designed to be "real crypto" - but sadly it's also the only thing that is fast enough for a lot of networking.
So it may be _designed_ to be "not real crypto" and to have a discoverable internal state. But once again, reality interferes, and it turns out that people really want something very very fast that is also not deterministic enough to be discoverable at least remotely.
The stuff that is actually designed and intended to be a complete black box is sadly also much too slow. By about an order of magnitude.
Linus
On Wed, 5 Aug 2020 09:39:17 -0700 Linus Torvalds torvalds@linux-foundation.org wrote:
On Wed, Aug 5, 2020 at 8:44 AM Marc Plumb lkml.mplumb@gmail.com wrote:
I thought net_rand_state was assumed to be insecure and that anyone could determine the internal state. Isn't this Working as Designed?
I was working as designed - because it wasn't really designed to be "real crypto" - but sadly it's also the only thing that is fast enough for a lot of networking.
So it may be _designed_ to be "not real crypto" and to have a discoverable internal state. But once again, reality interferes, and it turns out that people really want something very very fast that is also not deterministic enough to be discoverable at least remotely.
If you turn on the wayback machine, the original net_random came out of having ok values for network simulation with netem. In that use case, the point was to just have something with good distribution and reasonably long period.
Then the idea of TCP ports have to be random came along and that was both a weak security feature and benchmark sensitive so the net_random got drafted into more areas.
linux-stable-mirror@lists.linaro.org