The initial Zinc patchset, after some mailing list discussion, contained code to ensure that kernel_fpu_enable would not be kept on for more than a PAGE_SIZE chunk, since it disables preemption. The choice of PAGE_SIZE isn't totally scientific, but it's not a bad guess either, and it's what's used in both the x86 poly1305 and blake2s library code already. Unfortunately it appears to have been left out of the final patchset that actually added the glue code. So, this commit adds back the PAGE_SIZE chunking.
Fixes: 84e03fa39fbe ("crypto: x86/chacha - expose SIMD ChaCha routine as library function") Fixes: b3aad5bad26a ("crypto: arm64/chacha - expose arm64 ChaCha routine as library function") Fixes: a44a3430d71b ("crypto: arm/chacha - expose ARM ChaCha routine as library function") Fixes: f569ca164751 ("crypto: arm64/poly1305 - incorporate OpenSSL/CRYPTOGAMS NEON implementation") Fixes: a6b803b3ddc7 ("crypto: arm/poly1305 - incorporate OpenSSL/CRYPTOGAMS NEON implementation") Cc: Eric Biggers ebiggers@google.com Cc: Ard Biesheuvel ardb@kernel.org Cc: stable@vger.kernel.org Signed-off-by: Jason A. Donenfeld Jason@zx2c4.com --- Eric, Ard - I'm wondering if this was in fact just an oversight in Ard's patches, or if there was actually some later discussion in which we concluded that the PAGE_SIZE chunking wasn't required, perhaps because of FPU changes. If that's the case, please do let me know, in which case I'll submit a _different_ patch that removes the chunking from x86 poly and blake. I can't find any emails that would indicate that, but I might be mistaken.
arch/arm/crypto/chacha-glue.c | 16 +++++++++++++--- arch/arm/crypto/poly1305-glue.c | 17 +++++++++++++---- arch/arm64/crypto/chacha-neon-glue.c | 16 +++++++++++++--- arch/arm64/crypto/poly1305-glue.c | 17 +++++++++++++---- arch/x86/crypto/chacha_glue.c | 16 +++++++++++++--- 5 files changed, 65 insertions(+), 17 deletions(-)
diff --git a/arch/arm/crypto/chacha-glue.c b/arch/arm/crypto/chacha-glue.c index 6fdb0ac62b3d..0e29ebac95fd 100644 --- a/arch/arm/crypto/chacha-glue.c +++ b/arch/arm/crypto/chacha-glue.c @@ -91,9 +91,19 @@ void chacha_crypt_arch(u32 *state, u8 *dst, const u8 *src, unsigned int bytes, return; }
- kernel_neon_begin(); - chacha_doneon(state, dst, src, bytes, nrounds); - kernel_neon_end(); + for (;;) { + unsigned int todo = min_t(unsigned int, PAGE_SIZE, bytes); + + kernel_neon_begin(); + chacha_doneon(state, dst, src, todo, nrounds); + kernel_neon_end(); + + bytes -= todo; + if (!bytes) + break; + src += todo; + dst += todo; + } } EXPORT_SYMBOL(chacha_crypt_arch);
diff --git a/arch/arm/crypto/poly1305-glue.c b/arch/arm/crypto/poly1305-glue.c index ceec04ec2f40..536a4a943ebe 100644 --- a/arch/arm/crypto/poly1305-glue.c +++ b/arch/arm/crypto/poly1305-glue.c @@ -160,13 +160,22 @@ void poly1305_update_arch(struct poly1305_desc_ctx *dctx, const u8 *src, unsigned int len = round_down(nbytes, POLY1305_BLOCK_SIZE);
if (static_branch_likely(&have_neon) && do_neon) { - kernel_neon_begin(); - poly1305_blocks_neon(&dctx->h, src, len, 1); - kernel_neon_end(); + for (;;) { + unsigned int todo = min_t(unsigned int, PAGE_SIZE, len); + + kernel_neon_begin(); + poly1305_blocks_neon(&dctx->h, src, todo, 1); + kernel_neon_end(); + + len -= todo; + if (!len) + break; + src += todo; + } } else { poly1305_blocks_arm(&dctx->h, src, len, 1); + src += len; } - src += len; nbytes %= POLY1305_BLOCK_SIZE; }
diff --git a/arch/arm64/crypto/chacha-neon-glue.c b/arch/arm64/crypto/chacha-neon-glue.c index 37ca3e889848..3eff767f4f77 100644 --- a/arch/arm64/crypto/chacha-neon-glue.c +++ b/arch/arm64/crypto/chacha-neon-glue.c @@ -87,9 +87,19 @@ void chacha_crypt_arch(u32 *state, u8 *dst, const u8 *src, unsigned int bytes, !crypto_simd_usable()) return chacha_crypt_generic(state, dst, src, bytes, nrounds);
- kernel_neon_begin(); - chacha_doneon(state, dst, src, bytes, nrounds); - kernel_neon_end(); + for (;;) { + unsigned int todo = min_t(unsigned int, PAGE_SIZE, bytes); + + kernel_neon_begin(); + chacha_doneon(state, dst, src, todo, nrounds); + kernel_neon_end(); + + bytes -= todo; + if (!bytes) + break; + src += todo; + dst += todo; + } } EXPORT_SYMBOL(chacha_crypt_arch);
diff --git a/arch/arm64/crypto/poly1305-glue.c b/arch/arm64/crypto/poly1305-glue.c index e97b092f56b8..616134bef02c 100644 --- a/arch/arm64/crypto/poly1305-glue.c +++ b/arch/arm64/crypto/poly1305-glue.c @@ -143,13 +143,22 @@ void poly1305_update_arch(struct poly1305_desc_ctx *dctx, const u8 *src, unsigned int len = round_down(nbytes, POLY1305_BLOCK_SIZE);
if (static_branch_likely(&have_neon) && crypto_simd_usable()) { - kernel_neon_begin(); - poly1305_blocks_neon(&dctx->h, src, len, 1); - kernel_neon_end(); + for (;;) { + unsigned int todo = min_t(unsigned int, PAGE_SIZE, len); + + kernel_neon_begin(); + poly1305_blocks_neon(&dctx->h, src, todo, 1); + kernel_neon_end(); + + len -= todo; + if (!len) + break; + src += todo; + } } else { poly1305_blocks(&dctx->h, src, len, 1); + src += len; } - src += len; nbytes %= POLY1305_BLOCK_SIZE; }
diff --git a/arch/x86/crypto/chacha_glue.c b/arch/x86/crypto/chacha_glue.c index b412c21ee06e..10733035b81c 100644 --- a/arch/x86/crypto/chacha_glue.c +++ b/arch/x86/crypto/chacha_glue.c @@ -153,9 +153,19 @@ void chacha_crypt_arch(u32 *state, u8 *dst, const u8 *src, unsigned int bytes, bytes <= CHACHA_BLOCK_SIZE) return chacha_crypt_generic(state, dst, src, bytes, nrounds);
- kernel_fpu_begin(); - chacha_dosimd(state, dst, src, bytes, nrounds); - kernel_fpu_end(); + for (;;) { + unsigned int todo = min_t(unsigned int, PAGE_SIZE, bytes); + + kernel_fpu_begin(); + chacha_dosimd(state, dst, src, todo, nrounds); + kernel_fpu_end(); + + bytes -= todo; + if (!bytes) + break; + src += todo; + dst += todo; + } } EXPORT_SYMBOL(chacha_crypt_arch);
From: Jason A. Donenfeld
Sent: 20 April 2020 08:57
The initial Zinc patchset, after some mailing list discussion, contained code to ensure that kernel_fpu_enable would not be kept on for more than a PAGE_SIZE chunk, since it disables preemption. The choice of PAGE_SIZE isn't totally scientific, but it's not a bad guess either, and it's what's used in both the x86 poly1305 and blake2s library code already. Unfortunately it appears to have been left out of the final patchset that actually added the glue code. So, this commit adds back the PAGE_SIZE chunking.
...
Eric, Ard - I'm wondering if this was in fact just an oversight in Ard's patches, or if there was actually some later discussion in which we concluded that the PAGE_SIZE chunking wasn't required, perhaps because of FPU changes. If that's the case, please do let me know, in which case I'll submit a _different_ patch that removes the chunking from x86 poly and blake. I can't find any emails that would indicate that, but I might be mistaken.
Maybe kernel_fp_begin() should be passed the address of somewhere the address of an fpu save area buffer can be written to. Then the pre-emption code can allocate the buffer and save the state into it.
However that doesn't solve the problem for non-preemptive kernels. The may need a cond_resched() in the loop if it might take 1ms (or so).
kernel_fpu_begin() ought also be passed a parameter saying which fpu features are required, and return which are allocated. On x86 this could be used to check for AVX512 (etc) which may be available in an ISR unless it interrupted inside a kernel_fpu_begin() section (etc). It would also allow optimisations if only 1 or 2 fpu registers are needed (eg for some of the crypto functions) rather than the whole fpu register set.
David
- Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK Registration No: 1397386 (Wales)
On Mon, Apr 20, 2020 at 2:32 AM David Laight David.Laight@aculab.com wrote:
Maybe kernel_fp_begin() should be passed the address of somewhere the address of an fpu save area buffer can be written to. Then the pre-emption code can allocate the buffer and save the state into it.
However that doesn't solve the problem for non-preemptive kernels. The may need a cond_resched() in the loop if it might take 1ms (or so).
kernel_fpu_begin() ought also be passed a parameter saying which fpu features are required, and return which are allocated. On x86 this could be used to check for AVX512 (etc) which may be available in an ISR unless it interrupted inside a kernel_fpu_begin() section (etc). It would also allow optimisations if only 1 or 2 fpu registers are needed (eg for some of the crypto functions) rather than the whole fpu register set.
There might be ways to improve lots of FPU things, indeed. This patch here is just a patch to Herbert's branch in order to make uniform usage of our existing solution for this, fixing the existing bug. I wouldn't mind seeing more involved and better solutions in a patchset for crypto-next.
Will follow up with your suggestion in a different thread, so as not to block this one.
Hi David,
On Mon, Apr 20, 2020 at 2:32 AM David Laight David.Laight@aculab.com wrote:
Maybe kernel_fp_begin() should be passed the address of somewhere the address of an fpu save area buffer can be written to. Then the pre-emption code can allocate the buffer and save the state into it.
Interesting idea. It looks like `struct xregs_state` is only 576 bytes. That's not exactly small, but it's not insanely huge either, and maybe we could justifiably stick that on the stack, or even reserve part of the stack allocation for that that the function would know about, without needing to specify any address.
kernel_fpu_begin() ought also be passed a parameter saying which fpu features are required, and return which are allocated. On x86 this could be used to check for AVX512 (etc) which may be available in an ISR unless it interrupted inside a kernel_fpu_begin() section (etc). It would also allow optimisations if only 1 or 2 fpu registers are needed (eg for some of the crypto functions) rather than the whole fpu register set.
For AVX512 this probably makes sense, I suppose. But I'm not sure if there are too many bits of crypto code that only use a few registers. There are those accelerated memcpy routines in i915 though -- ever see drivers/gpu/drm/i915/i915_memcpy.c? sort of wild. But if we did go this way, I wonder if it'd make sense to totally overengineer it and write a gcc/as plugin to create the register mask for us. Or, maybe some checker inside of objtool could help here.
Actually, though, the thing I've been wondering about is actually moving in the complete opposite direction: is there some efficient-enough way that we could allow FPU registers in all contexts always, without the need for kernel_fpu_begin/end? I was reversing ntoskrnl.exe and was kind of impressed (maybe not the right word?) by their judicious use of vectorisation everywhere. I assume a lot of that is being generated by their compiler, which of course gcc could do for us if we let it. Is that an interesting avenue to consider? Or are you pretty certain that it'd be a huge mistake, with an irreversible speed hit?
Jason
On Mon, Apr 20, 2020 at 10:14 PM Jason A. Donenfeld Jason@zx2c4.com wrote:
Hi David,
On Mon, Apr 20, 2020 at 2:32 AM David Laight David.Laight@aculab.com wrote:
Maybe kernel_fp_begin() should be passed the address of somewhere the address of an fpu save area buffer can be written to. Then the pre-emption code can allocate the buffer and save the state into it.
Interesting idea. It looks like `struct xregs_state` is only 576 bytes. That's not exactly small, but it's not insanely huge either, and maybe we could justifiably stick that on the stack, or even reserve part of the stack allocation for that that the function would know about, without needing to specify any address.
Hah-hah, nevermind here. extended_state_area is of course huge, bringing the whole structure to a whopping 3k with avx512. :)
On Tue, 21 Apr 2020 at 06:15, Jason A. Donenfeld Jason@zx2c4.com wrote:
Hi David,
On Mon, Apr 20, 2020 at 2:32 AM David Laight David.Laight@aculab.com wrote:
Maybe kernel_fp_begin() should be passed the address of somewhere the address of an fpu save area buffer can be written to. Then the pre-emption code can allocate the buffer and save the state into it.
Interesting idea. It looks like `struct xregs_state` is only 576 bytes. That's not exactly small, but it's not insanely huge either, and maybe we could justifiably stick that on the stack, or even reserve part of the stack allocation for that that the function would know about, without needing to specify any address.
kernel_fpu_begin() ought also be passed a parameter saying which fpu features are required, and return which are allocated. On x86 this could be used to check for AVX512 (etc) which may be available in an ISR unless it interrupted inside a kernel_fpu_begin() section (etc). It would also allow optimisations if only 1 or 2 fpu registers are needed (eg for some of the crypto functions) rather than the whole fpu register set.
For AVX512 this probably makes sense, I suppose. But I'm not sure if there are too many bits of crypto code that only use a few registers. There are those accelerated memcpy routines in i915 though -- ever see drivers/gpu/drm/i915/i915_memcpy.c? sort of wild. But if we did go this way, I wonder if it'd make sense to totally overengineer it and write a gcc/as plugin to create the register mask for us. Or, maybe some checker inside of objtool could help here.
Actually, though, the thing I've been wondering about is actually moving in the complete opposite direction: is there some efficient-enough way that we could allow FPU registers in all contexts always, without the need for kernel_fpu_begin/end? I was reversing ntoskrnl.exe and was kind of impressed (maybe not the right word?) by their judicious use of vectorisation everywhere. I assume a lot of that is being generated by their compiler, which of course gcc could do for us if we let it. Is that an interesting avenue to consider? Or are you pretty certain that it'd be a huge mistake, with an irreversible speed hit?
When I added support for kernel mode SIMD to arm64 originally, there was a kernel_neon_begin_partial() that took an int to specify how many registers were being used, the reason being that NEON preserve/store was fully eager at this point, and arm64 has 32 SIMD registers, many of which weren't really used, e.g., in the basic implementation of AES based on special instructions.
With the introduction of lazy restore, and SVE handling for userspace, we decided to remove this because it didn't really help anymore, and made the code more difficult to manage.
However, I think it would make sense to have something like this in the general case. I.e., NEON registers 0-3 are always preserved when an exception or interrupt (or syscall) is taken, and so they can be used anywhere in the kernel. If you want the whole set, you will have to use begin/end as before. This would already unlock a few interesting case, like memcpy, xor, and sequences that can easily be implemented with only a few registers such as instructio based AES.
Unfortunately, the compiler needs to be taught about this to be completely useful, which means lots of prototyping and benchmarking upfront, as the contract will be set in stone once the compilers get on board.
From: Ard Biesheuvel
Sent: 21 April 2020 08:02 On Tue, 21 Apr 2020 at 06:15, Jason A. Donenfeld Jason@zx2c4.com wrote:
Hi David,
On Mon, Apr 20, 2020 at 2:32 AM David Laight David.Laight@aculab.com wrote:
Maybe kernel_fp_begin() should be passed the address of somewhere the address of an fpu save area buffer can be written to. Then the pre-emption code can allocate the buffer and save the state into it.
Interesting idea. It looks like `struct xregs_state` is only 576 bytes. That's not exactly small, but it's not insanely huge either, and maybe we could justifiably stick that on the stack, or even reserve part of the stack allocation for that that the function would know about, without needing to specify any address.
kernel_fpu_begin() ought also be passed a parameter saying which fpu features are required, and return which are allocated. On x86 this could be used to check for AVX512 (etc) which may be available in an ISR unless it interrupted inside a kernel_fpu_begin() section (etc). It would also allow optimisations if only 1 or 2 fpu registers are needed (eg for some of the crypto functions) rather than the whole fpu register set.
For AVX512 this probably makes sense, I suppose. But I'm not sure if there are too many bits of crypto code that only use a few registers. There are those accelerated memcpy routines in i915 though -- ever see drivers/gpu/drm/i915/i915_memcpy.c? sort of wild. But if we did go this way, I wonder if it'd make sense to totally overengineer it and write a gcc/as plugin to create the register mask for us. Or, maybe some checker inside of objtool could help here.
Actually, though, the thing I've been wondering about is actually moving in the complete opposite direction: is there some efficient-enough way that we could allow FPU registers in all contexts always, without the need for kernel_fpu_begin/end? I was reversing ntoskrnl.exe and was kind of impressed (maybe not the right word?) by their judicious use of vectorisation everywhere. I assume a lot of that is being generated by their compiler, which of course gcc could do for us if we let it. Is that an interesting avenue to consider? Or are you pretty certain that it'd be a huge mistake, with an irreversible speed hit?
When I added support for kernel mode SIMD to arm64 originally, there was a kernel_neon_begin_partial() that took an int to specify how many registers were being used, the reason being that NEON preserve/store was fully eager at this point, and arm64 has 32 SIMD registers, many of which weren't really used, e.g., in the basic implementation of AES based on special instructions.
With the introduction of lazy restore, and SVE handling for userspace, we decided to remove this because it didn't really help anymore, and made the code more difficult to manage.
However, I think it would make sense to have something like this in the general case. I.e., NEON registers 0-3 are always preserved when an exception or interrupt (or syscall) is taken, and so they can be used anywhere in the kernel. If you want the whole set, you will have to use begin/end as before. This would already unlock a few interesting case, like memcpy, xor, and sequences that can easily be implemented with only a few registers such as instructio based AES.
Unfortunately, the compiler needs to be taught about this to be completely useful, which means lots of prototyping and benchmarking upfront, as the contract will be set in stone once the compilers get on board.
You can always just use asm with explicit registers.
David
- Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK Registration No: 1397386 (Wales)
From: Jason A. Donenfeld
Sent: 21 April 2020 05:15
Hi David,
On Mon, Apr 20, 2020 at 2:32 AM David Laight David.Laight@aculab.com wrote:
Maybe kernel_fp_begin() should be passed the address of somewhere the address of an fpu save area buffer can be written to. Then the pre-emption code can allocate the buffer and save the state into it.
Interesting idea. It looks like `struct xregs_state` is only 576 bytes. That's not exactly small, but it's not insanely huge either, and maybe we could justifiably stick that on the stack, or even reserve part of the stack allocation for that that the function would know about, without needing to specify any address.
As you said yourself, with AVX512 it is much larger. Which is why I suggested the save code could allocate the area. Note that this would only be needed for nested use (for a full save).
kernel_fpu_begin() ought also be passed a parameter saying which fpu features are required, and return which are allocated. On x86 this could be used to check for AVX512 (etc) which may be available in an ISR unless it interrupted inside a kernel_fpu_begin() section (etc). It would also allow optimisations if only 1 or 2 fpu registers are needed (eg for some of the crypto functions) rather than the whole fpu register set.
For AVX512 this probably makes sense, I suppose. But I'm not sure if there are too many bits of crypto code that only use a few registers. There are those accelerated memcpy routines in i915 though -- ever see drivers/gpu/drm/i915/i915_memcpy.c? sort of wild. But if we did go this way, I wonder if it'd make sense to totally overengineer it and write a gcc/as plugin to create the register mask for us. Or, maybe some checker inside of objtool could help here.
I suspect some of that code is overly unrolled.
Actually, though, the thing I've been wondering about is actually moving in the complete opposite direction: is there some efficient-enough way that we could allow FPU registers in all contexts always, without the need for kernel_fpu_begin/end? I was reversing ntoskrnl.exe and was kind of impressed (maybe not the right word?) by their judicious use of vectorisation everywhere. I assume a lot of that is being generated by their compiler, which of course gcc could do for us if we let it. Is that an interesting avenue to consider? Or are you pretty certain that it'd be a huge mistake, with an irreversible speed hit?
I think windows takes the 'hit' of saving the entire fpu state on every kernel entry. Note that for system calls this is actually minimal. All the 'callee saved' registers (most of the fpu ones) can be trashed - ie reloaded with zeros.
David
- Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK Registration No: 1397386 (Wales)
On Mon, Apr 20, 2020 at 01:57:11AM -0600, Jason A. Donenfeld wrote:
The initial Zinc patchset, after some mailing list discussion, contained code to ensure that kernel_fpu_enable would not be kept on for more than a PAGE_SIZE chunk, since it disables preemption. The choice of PAGE_SIZE isn't totally scientific, but it's not a bad guess either, and it's what's used in both the x86 poly1305 and blake2s library code already. Unfortunately it appears to have been left out of the final patchset that actually added the glue code. So, this commit adds back the PAGE_SIZE chunking.
Fixes: 84e03fa39fbe ("crypto: x86/chacha - expose SIMD ChaCha routine as library function") Fixes: b3aad5bad26a ("crypto: arm64/chacha - expose arm64 ChaCha routine as library function") Fixes: a44a3430d71b ("crypto: arm/chacha - expose ARM ChaCha routine as library function") Fixes: f569ca164751 ("crypto: arm64/poly1305 - incorporate OpenSSL/CRYPTOGAMS NEON implementation") Fixes: a6b803b3ddc7 ("crypto: arm/poly1305 - incorporate OpenSSL/CRYPTOGAMS NEON implementation") Cc: Eric Biggers ebiggers@google.com Cc: Ard Biesheuvel ardb@kernel.org Cc: stable@vger.kernel.org Signed-off-by: Jason A. Donenfeld Jason@zx2c4.com
Eric, Ard - I'm wondering if this was in fact just an oversight in Ard's patches, or if there was actually some later discussion in which we concluded that the PAGE_SIZE chunking wasn't required, perhaps because of FPU changes. If that's the case, please do let me know, in which case I'll submit a _different_ patch that removes the chunking from x86 poly and blake. I can't find any emails that would indicate that, but I might be mistaken.
arch/arm/crypto/chacha-glue.c | 16 +++++++++++++--- arch/arm/crypto/poly1305-glue.c | 17 +++++++++++++---- arch/arm64/crypto/chacha-neon-glue.c | 16 +++++++++++++--- arch/arm64/crypto/poly1305-glue.c | 17 +++++++++++++---- arch/x86/crypto/chacha_glue.c | 16 +++++++++++++--- 5 files changed, 65 insertions(+), 17 deletions(-)
I don't think you're missing anything. On x86, kernel_fpu_begin() and kernel_fpu_end() did get optimized in v5.2. But they still disable preemption, which is the concern here.
diff --git a/arch/arm/crypto/chacha-glue.c b/arch/arm/crypto/chacha-glue.c index 6fdb0ac62b3d..0e29ebac95fd 100644 --- a/arch/arm/crypto/chacha-glue.c +++ b/arch/arm/crypto/chacha-glue.c @@ -91,9 +91,19 @@ void chacha_crypt_arch(u32 *state, u8 *dst, const u8 *src, unsigned int bytes, return; }
- kernel_neon_begin();
- chacha_doneon(state, dst, src, bytes, nrounds);
- kernel_neon_end();
- for (;;) {
unsigned int todo = min_t(unsigned int, PAGE_SIZE, bytes);
kernel_neon_begin();
chacha_doneon(state, dst, src, todo, nrounds);
kernel_neon_end();
bytes -= todo;
if (!bytes)
break;
src += todo;
dst += todo;
- }
} EXPORT_SYMBOL(chacha_crypt_arch);
Seems this should just be a 'while' loop?
while (bytes) { unsigned int todo = min_t(unsigned int, PAGE_SIZE, bytes);
kernel_neon_begin(); chacha_doneon(state, dst, src, todo, nrounds); kernel_neon_end();
bytes -= todo; src += todo; dst += todo; }
Likewise elsewhere in this patch. (For Poly1305, len >= POLY1305_BLOCK_SIZE at the beginning, so that could use a 'do' loop.)
- Eric
On Tue, Apr 21, 2020 at 10:04 PM Eric Biggers ebiggers@kernel.org wrote:
Seems this should just be a 'while' loop?
while (bytes) { unsigned int todo = min_t(unsigned int, PAGE_SIZE, bytes); kernel_neon_begin(); chacha_doneon(state, dst, src, todo, nrounds); kernel_neon_end(); bytes -= todo; src += todo; dst += todo; }
The for(;;) is how it's done elsewhere in the kernel (that this patch doesn't touch), because then we can break out of the loop before having to increment src and dst unnecessarily. Likely a pointless optimization as probably the compiler can figure out how to avoid that. But maybe it can't. If you have a strong preference, I can reactor everything to use `while (bytes)`, but if you don't care, let's keep this as-is. Opinion?
Jason
The initial Zinc patchset, after some mailing list discussion, contained code to ensure that kernel_fpu_enable would not be kept on for more than a 4k chunk, since it disables preemption. The choice of 4k isn't totally scientific, but it's not a bad guess either, and it's what's used in both the x86 poly1305, blake2s, and nhpoly1305 code already (in the form of PAGE_SIZE, which this commit corrects to be explicitly 4k).
Ard did some back of the envelope calculations and found that at 5 cycles/byte (overestimate) on a 1ghz processor (pretty slow), 4k means we have a maximum preemption disabling of 20us, which Sebastian confirmed was probably a good limit.
Unfortunately the chunking appears to have been left out of the final patchset that added the glue code. So, this commit adds it back in.
Fixes: 84e03fa39fbe ("crypto: x86/chacha - expose SIMD ChaCha routine as library function") Fixes: b3aad5bad26a ("crypto: arm64/chacha - expose arm64 ChaCha routine as library function") Fixes: a44a3430d71b ("crypto: arm/chacha - expose ARM ChaCha routine as library function") Fixes: d7d7b8535662 ("crypto: x86/poly1305 - wire up faster implementations for kernel") Fixes: f569ca164751 ("crypto: arm64/poly1305 - incorporate OpenSSL/CRYPTOGAMS NEON implementation") Fixes: a6b803b3ddc7 ("crypto: arm/poly1305 - incorporate OpenSSL/CRYPTOGAMS NEON implementation") Fixes: 0f961f9f670e ("crypto: x86/nhpoly1305 - add AVX2 accelerated NHPoly1305") Fixes: 012c82388c03 ("crypto: x86/nhpoly1305 - add SSE2 accelerated NHPoly1305") Fixes: a00fa0c88774 ("crypto: arm64/nhpoly1305 - add NEON-accelerated NHPoly1305") Fixes: 16aae3595a9d ("crypto: arm/nhpoly1305 - add NEON-accelerated NHPoly1305") Fixes: ed0356eda153 ("crypto: blake2s - x86_64 SIMD implementation") Cc: Eric Biggers ebiggers@google.com Cc: Ard Biesheuvel ardb@kernel.org Cc: Sebastian Andrzej Siewior bigeasy@linutronix.de Cc: stable@vger.kernel.org Signed-off-by: Jason A. Donenfeld Jason@zx2c4.com --- Changes v1->v2: - [Ard] Use explicit 4k chunks instead of PAGE_SIZE. - [Eric] Prefer do-while over for (;;).
arch/arm/crypto/chacha-glue.c | 14 +++++++++++--- arch/arm/crypto/nhpoly1305-neon-glue.c | 2 +- arch/arm/crypto/poly1305-glue.c | 15 +++++++++++---- arch/arm64/crypto/chacha-neon-glue.c | 14 +++++++++++--- arch/arm64/crypto/nhpoly1305-neon-glue.c | 2 +- arch/arm64/crypto/poly1305-glue.c | 15 +++++++++++---- arch/x86/crypto/blake2s-glue.c | 10 ++++------ arch/x86/crypto/chacha_glue.c | 14 +++++++++++--- arch/x86/crypto/nhpoly1305-avx2-glue.c | 2 +- arch/x86/crypto/nhpoly1305-sse2-glue.c | 2 +- arch/x86/crypto/poly1305_glue.c | 13 ++++++------- 11 files changed, 69 insertions(+), 34 deletions(-)
diff --git a/arch/arm/crypto/chacha-glue.c b/arch/arm/crypto/chacha-glue.c index 6fdb0ac62b3d..59da6c0b63b6 100644 --- a/arch/arm/crypto/chacha-glue.c +++ b/arch/arm/crypto/chacha-glue.c @@ -91,9 +91,17 @@ void chacha_crypt_arch(u32 *state, u8 *dst, const u8 *src, unsigned int bytes, return; }
- kernel_neon_begin(); - chacha_doneon(state, dst, src, bytes, nrounds); - kernel_neon_end(); + do { + unsigned int todo = min_t(unsigned int, bytes, SZ_4K); + + kernel_neon_begin(); + chacha_doneon(state, dst, src, todo, nrounds); + kernel_neon_end(); + + bytes -= todo; + src += todo; + dst += todo; + } while (bytes); } EXPORT_SYMBOL(chacha_crypt_arch);
diff --git a/arch/arm/crypto/nhpoly1305-neon-glue.c b/arch/arm/crypto/nhpoly1305-neon-glue.c index ae5aefc44a4d..ffa8d73fe722 100644 --- a/arch/arm/crypto/nhpoly1305-neon-glue.c +++ b/arch/arm/crypto/nhpoly1305-neon-glue.c @@ -30,7 +30,7 @@ static int nhpoly1305_neon_update(struct shash_desc *desc, return crypto_nhpoly1305_update(desc, src, srclen);
do { - unsigned int n = min_t(unsigned int, srclen, PAGE_SIZE); + unsigned int n = min_t(unsigned int, srclen, SZ_4K);
kernel_neon_begin(); crypto_nhpoly1305_update_helper(desc, src, n, _nh_neon); diff --git a/arch/arm/crypto/poly1305-glue.c b/arch/arm/crypto/poly1305-glue.c index ceec04ec2f40..13cfef4ae22e 100644 --- a/arch/arm/crypto/poly1305-glue.c +++ b/arch/arm/crypto/poly1305-glue.c @@ -160,13 +160,20 @@ void poly1305_update_arch(struct poly1305_desc_ctx *dctx, const u8 *src, unsigned int len = round_down(nbytes, POLY1305_BLOCK_SIZE);
if (static_branch_likely(&have_neon) && do_neon) { - kernel_neon_begin(); - poly1305_blocks_neon(&dctx->h, src, len, 1); - kernel_neon_end(); + do { + unsigned int todo = min_t(unsigned int, len, SZ_4K); + + kernel_neon_begin(); + poly1305_blocks_neon(&dctx->h, src, todo, 1); + kernel_neon_end(); + + len -= todo; + src += todo; + } while (len); } else { poly1305_blocks_arm(&dctx->h, src, len, 1); + src += len; } - src += len; nbytes %= POLY1305_BLOCK_SIZE; }
diff --git a/arch/arm64/crypto/chacha-neon-glue.c b/arch/arm64/crypto/chacha-neon-glue.c index 37ca3e889848..af2bbca38e70 100644 --- a/arch/arm64/crypto/chacha-neon-glue.c +++ b/arch/arm64/crypto/chacha-neon-glue.c @@ -87,9 +87,17 @@ void chacha_crypt_arch(u32 *state, u8 *dst, const u8 *src, unsigned int bytes, !crypto_simd_usable()) return chacha_crypt_generic(state, dst, src, bytes, nrounds);
- kernel_neon_begin(); - chacha_doneon(state, dst, src, bytes, nrounds); - kernel_neon_end(); + do { + unsigned int todo = min_t(unsigned int, bytes, SZ_4K); + + kernel_neon_begin(); + chacha_doneon(state, dst, src, todo, nrounds); + kernel_neon_end(); + + bytes -= todo; + src += todo; + dst += todo; + } while (bytes); } EXPORT_SYMBOL(chacha_crypt_arch);
diff --git a/arch/arm64/crypto/nhpoly1305-neon-glue.c b/arch/arm64/crypto/nhpoly1305-neon-glue.c index 895d3727c1fb..c5405e6a6db7 100644 --- a/arch/arm64/crypto/nhpoly1305-neon-glue.c +++ b/arch/arm64/crypto/nhpoly1305-neon-glue.c @@ -30,7 +30,7 @@ static int nhpoly1305_neon_update(struct shash_desc *desc, return crypto_nhpoly1305_update(desc, src, srclen);
do { - unsigned int n = min_t(unsigned int, srclen, PAGE_SIZE); + unsigned int n = min_t(unsigned int, srclen, SZ_4K);
kernel_neon_begin(); crypto_nhpoly1305_update_helper(desc, src, n, _nh_neon); diff --git a/arch/arm64/crypto/poly1305-glue.c b/arch/arm64/crypto/poly1305-glue.c index e97b092f56b8..f33ada70c4ed 100644 --- a/arch/arm64/crypto/poly1305-glue.c +++ b/arch/arm64/crypto/poly1305-glue.c @@ -143,13 +143,20 @@ void poly1305_update_arch(struct poly1305_desc_ctx *dctx, const u8 *src, unsigned int len = round_down(nbytes, POLY1305_BLOCK_SIZE);
if (static_branch_likely(&have_neon) && crypto_simd_usable()) { - kernel_neon_begin(); - poly1305_blocks_neon(&dctx->h, src, len, 1); - kernel_neon_end(); + do { + unsigned int todo = min_t(unsigned int, len, SZ_4K); + + kernel_neon_begin(); + poly1305_blocks_neon(&dctx->h, src, todo, 1); + kernel_neon_end(); + + len -= todo; + src += todo; + } while (len); } else { poly1305_blocks(&dctx->h, src, len, 1); + src += len; } - src += len; nbytes %= POLY1305_BLOCK_SIZE; }
diff --git a/arch/x86/crypto/blake2s-glue.c b/arch/x86/crypto/blake2s-glue.c index 06ef2d4a4701..6737bcea1fa1 100644 --- a/arch/x86/crypto/blake2s-glue.c +++ b/arch/x86/crypto/blake2s-glue.c @@ -32,16 +32,16 @@ void blake2s_compress_arch(struct blake2s_state *state, const u32 inc) { /* SIMD disables preemption, so relax after processing each page. */ - BUILD_BUG_ON(PAGE_SIZE / BLAKE2S_BLOCK_SIZE < 8); + BUILD_BUG_ON(SZ_4K / BLAKE2S_BLOCK_SIZE < 8);
if (!static_branch_likely(&blake2s_use_ssse3) || !crypto_simd_usable()) { blake2s_compress_generic(state, block, nblocks, inc); return; }
- for (;;) { + do { const size_t blocks = min_t(size_t, nblocks, - PAGE_SIZE / BLAKE2S_BLOCK_SIZE); + SZ_4K / BLAKE2S_BLOCK_SIZE);
kernel_fpu_begin(); if (IS_ENABLED(CONFIG_AS_AVX512) && @@ -52,10 +52,8 @@ void blake2s_compress_arch(struct blake2s_state *state, kernel_fpu_end();
nblocks -= blocks; - if (!nblocks) - break; block += blocks * BLAKE2S_BLOCK_SIZE; - } + } while (nblocks); } EXPORT_SYMBOL(blake2s_compress_arch);
diff --git a/arch/x86/crypto/chacha_glue.c b/arch/x86/crypto/chacha_glue.c index b412c21ee06e..22250091cdbe 100644 --- a/arch/x86/crypto/chacha_glue.c +++ b/arch/x86/crypto/chacha_glue.c @@ -153,9 +153,17 @@ void chacha_crypt_arch(u32 *state, u8 *dst, const u8 *src, unsigned int bytes, bytes <= CHACHA_BLOCK_SIZE) return chacha_crypt_generic(state, dst, src, bytes, nrounds);
- kernel_fpu_begin(); - chacha_dosimd(state, dst, src, bytes, nrounds); - kernel_fpu_end(); + do { + unsigned int todo = min_t(unsigned int, bytes, SZ_4K); + + kernel_fpu_begin(); + chacha_dosimd(state, dst, src, todo, nrounds); + kernel_fpu_end(); + + bytes -= todo; + src += todo; + dst += todo; + } while (bytes); } EXPORT_SYMBOL(chacha_crypt_arch);
diff --git a/arch/x86/crypto/nhpoly1305-avx2-glue.c b/arch/x86/crypto/nhpoly1305-avx2-glue.c index f7567cbd35b6..80fcb85736e1 100644 --- a/arch/x86/crypto/nhpoly1305-avx2-glue.c +++ b/arch/x86/crypto/nhpoly1305-avx2-glue.c @@ -29,7 +29,7 @@ static int nhpoly1305_avx2_update(struct shash_desc *desc, return crypto_nhpoly1305_update(desc, src, srclen);
do { - unsigned int n = min_t(unsigned int, srclen, PAGE_SIZE); + unsigned int n = min_t(unsigned int, srclen, SZ_4K);
kernel_fpu_begin(); crypto_nhpoly1305_update_helper(desc, src, n, _nh_avx2); diff --git a/arch/x86/crypto/nhpoly1305-sse2-glue.c b/arch/x86/crypto/nhpoly1305-sse2-glue.c index a661ede3b5cf..cc6b7c1a2705 100644 --- a/arch/x86/crypto/nhpoly1305-sse2-glue.c +++ b/arch/x86/crypto/nhpoly1305-sse2-glue.c @@ -29,7 +29,7 @@ static int nhpoly1305_sse2_update(struct shash_desc *desc, return crypto_nhpoly1305_update(desc, src, srclen);
do { - unsigned int n = min_t(unsigned int, srclen, PAGE_SIZE); + unsigned int n = min_t(unsigned int, srclen, SZ_4K);
kernel_fpu_begin(); crypto_nhpoly1305_update_helper(desc, src, n, _nh_sse2); diff --git a/arch/x86/crypto/poly1305_glue.c b/arch/x86/crypto/poly1305_glue.c index 6dfec19f7d57..dfe921efa9b2 100644 --- a/arch/x86/crypto/poly1305_glue.c +++ b/arch/x86/crypto/poly1305_glue.c @@ -91,8 +91,8 @@ static void poly1305_simd_blocks(void *ctx, const u8 *inp, size_t len, struct poly1305_arch_internal *state = ctx;
/* SIMD disables preemption, so relax after processing each page. */ - BUILD_BUG_ON(PAGE_SIZE < POLY1305_BLOCK_SIZE || - PAGE_SIZE % POLY1305_BLOCK_SIZE); + BUILD_BUG_ON(SZ_4K < POLY1305_BLOCK_SIZE || + SZ_4K % POLY1305_BLOCK_SIZE);
if (!static_branch_likely(&poly1305_use_avx) || (len < (POLY1305_BLOCK_SIZE * 18) && !state->is_base2_26) || @@ -102,8 +102,8 @@ static void poly1305_simd_blocks(void *ctx, const u8 *inp, size_t len, return; }
- for (;;) { - const size_t bytes = min_t(size_t, len, PAGE_SIZE); + do { + const size_t bytes = min_t(size_t, len, SZ_4K);
kernel_fpu_begin(); if (IS_ENABLED(CONFIG_AS_AVX512) && static_branch_likely(&poly1305_use_avx512)) @@ -113,11 +113,10 @@ static void poly1305_simd_blocks(void *ctx, const u8 *inp, size_t len, else poly1305_blocks_avx(ctx, inp, bytes, padbit); kernel_fpu_end(); + len -= bytes; - if (!len) - break; inp += bytes; - } + } while (len); }
static void poly1305_simd_emit(void *ctx, u8 mac[POLY1305_DIGEST_SIZE],
On Wed, Apr 22, 2020 at 02:03:44PM -0600, Jason A. Donenfeld wrote:
The initial Zinc patchset, after some mailing list discussion, contained code to ensure that kernel_fpu_enable would not be kept on for more than a 4k chunk, since it disables preemption. The choice of 4k isn't totally scientific, but it's not a bad guess either, and it's what's used in both the x86 poly1305, blake2s, and nhpoly1305 code already (in the form of PAGE_SIZE, which this commit corrects to be explicitly 4k).
Ard did some back of the envelope calculations and found that at 5 cycles/byte (overestimate) on a 1ghz processor (pretty slow), 4k means we have a maximum preemption disabling of 20us, which Sebastian confirmed was probably a good limit.
Unfortunately the chunking appears to have been left out of the final patchset that added the glue code. So, this commit adds it back in.
Fixes: 84e03fa39fbe ("crypto: x86/chacha - expose SIMD ChaCha routine as library function") Fixes: b3aad5bad26a ("crypto: arm64/chacha - expose arm64 ChaCha routine as library function") Fixes: a44a3430d71b ("crypto: arm/chacha - expose ARM ChaCha routine as library function") Fixes: d7d7b8535662 ("crypto: x86/poly1305 - wire up faster implementations for kernel") Fixes: f569ca164751 ("crypto: arm64/poly1305 - incorporate OpenSSL/CRYPTOGAMS NEON implementation") Fixes: a6b803b3ddc7 ("crypto: arm/poly1305 - incorporate OpenSSL/CRYPTOGAMS NEON implementation") Fixes: 0f961f9f670e ("crypto: x86/nhpoly1305 - add AVX2 accelerated NHPoly1305") Fixes: 012c82388c03 ("crypto: x86/nhpoly1305 - add SSE2 accelerated NHPoly1305") Fixes: a00fa0c88774 ("crypto: arm64/nhpoly1305 - add NEON-accelerated NHPoly1305") Fixes: 16aae3595a9d ("crypto: arm/nhpoly1305 - add NEON-accelerated NHPoly1305") Fixes: ed0356eda153 ("crypto: blake2s - x86_64 SIMD implementation") Cc: Eric Biggers ebiggers@google.com Cc: Ard Biesheuvel ardb@kernel.org Cc: Sebastian Andrzej Siewior bigeasy@linutronix.de Cc: stable@vger.kernel.org Signed-off-by: Jason A. Donenfeld Jason@zx2c4.com
Changes v1->v2:
- [Ard] Use explicit 4k chunks instead of PAGE_SIZE.
- [Eric] Prefer do-while over for (;;).
arch/arm/crypto/chacha-glue.c | 14 +++++++++++--- arch/arm/crypto/nhpoly1305-neon-glue.c | 2 +- arch/arm/crypto/poly1305-glue.c | 15 +++++++++++---- arch/arm64/crypto/chacha-neon-glue.c | 14 +++++++++++--- arch/arm64/crypto/nhpoly1305-neon-glue.c | 2 +- arch/arm64/crypto/poly1305-glue.c | 15 +++++++++++---- arch/x86/crypto/blake2s-glue.c | 10 ++++------ arch/x86/crypto/chacha_glue.c | 14 +++++++++++--- arch/x86/crypto/nhpoly1305-avx2-glue.c | 2 +- arch/x86/crypto/nhpoly1305-sse2-glue.c | 2 +- arch/x86/crypto/poly1305_glue.c | 13 ++++++------- 11 files changed, 69 insertions(+), 34 deletions(-)
Can you split the nhpoly1305 changes into a separate patch? They're a bit different from the rest of this patch, which is fixing up the crypto library interface that's new in v5.5. The nhpoly1305 changes apply to v5.0 and don't have anything to do with the crypto library interface, and they're also a bit different since they replace PAGE_SIZE with 4K rather than unlimited with 4K.
- Eric
On Wed, Apr 22, 2020 at 4:39 PM Eric Biggers ebiggers@kernel.org wrote:
Can you split the nhpoly1305 changes into a separate patch? They're a bit different from the rest of this patch, which is fixing up the crypto library interface that's new in v5.5. The nhpoly1305 changes apply to v5.0 and don't have anything to do with the crypto library interface, and they're also a bit different since they replace PAGE_SIZE with 4K rather than unlimited with 4K.
Good point about the nhpoly change not being part of the library interface and thus backportable differently. I'll split that out and send a v3 shortly.
Jason
The initial Zinc patchset, after some mailing list discussion, contained code to ensure that kernel_fpu_enable would not be kept on for more than a 4k chunk, since it disables preemption. The choice of 4k isn't totally scientific, but it's not a bad guess either, and it's what's used in both the x86 poly1305, blake2s, and nhpoly1305 code already (in the form of PAGE_SIZE, which this commit corrects to be explicitly 4k for the former two).
Ard did some back of the envelope calculations and found that at 5 cycles/byte (overestimate) on a 1ghz processor (pretty slow), 4k means we have a maximum preemption disabling of 20us, which Sebastian confirmed was probably a good limit.
Unfortunately the chunking appears to have been left out of the final patchset that added the glue code. So, this commit adds it back in.
Fixes: 84e03fa39fbe ("crypto: x86/chacha - expose SIMD ChaCha routine as library function") Fixes: b3aad5bad26a ("crypto: arm64/chacha - expose arm64 ChaCha routine as library function") Fixes: a44a3430d71b ("crypto: arm/chacha - expose ARM ChaCha routine as library function") Fixes: d7d7b8535662 ("crypto: x86/poly1305 - wire up faster implementations for kernel") Fixes: f569ca164751 ("crypto: arm64/poly1305 - incorporate OpenSSL/CRYPTOGAMS NEON implementation") Fixes: a6b803b3ddc7 ("crypto: arm/poly1305 - incorporate OpenSSL/CRYPTOGAMS NEON implementation") Fixes: ed0356eda153 ("crypto: blake2s - x86_64 SIMD implementation") Cc: Eric Biggers ebiggers@google.com Cc: Ard Biesheuvel ardb@kernel.org Cc: Sebastian Andrzej Siewior bigeasy@linutronix.de Cc: stable@vger.kernel.org Signed-off-by: Jason A. Donenfeld Jason@zx2c4.com --- Changes v2->v3: - [Eric] Split nhpoly1305 changes into separate commit, since it's not related to the library interface.
Changes v1->v2: - [Ard] Use explicit 4k chunks instead of PAGE_SIZE. - [Eric] Prefer do-while over for (;;).
arch/arm/crypto/chacha-glue.c | 14 +++++++++++--- arch/arm/crypto/poly1305-glue.c | 15 +++++++++++---- arch/arm64/crypto/chacha-neon-glue.c | 14 +++++++++++--- arch/arm64/crypto/poly1305-glue.c | 15 +++++++++++---- arch/x86/crypto/blake2s-glue.c | 10 ++++------ arch/x86/crypto/chacha_glue.c | 14 +++++++++++--- arch/x86/crypto/poly1305_glue.c | 13 ++++++------- 7 files changed, 65 insertions(+), 30 deletions(-)
diff --git a/arch/arm/crypto/chacha-glue.c b/arch/arm/crypto/chacha-glue.c index 6fdb0ac62b3d..59da6c0b63b6 100644 --- a/arch/arm/crypto/chacha-glue.c +++ b/arch/arm/crypto/chacha-glue.c @@ -91,9 +91,17 @@ void chacha_crypt_arch(u32 *state, u8 *dst, const u8 *src, unsigned int bytes, return; }
- kernel_neon_begin(); - chacha_doneon(state, dst, src, bytes, nrounds); - kernel_neon_end(); + do { + unsigned int todo = min_t(unsigned int, bytes, SZ_4K); + + kernel_neon_begin(); + chacha_doneon(state, dst, src, todo, nrounds); + kernel_neon_end(); + + bytes -= todo; + src += todo; + dst += todo; + } while (bytes); } EXPORT_SYMBOL(chacha_crypt_arch);
diff --git a/arch/arm/crypto/poly1305-glue.c b/arch/arm/crypto/poly1305-glue.c index ceec04ec2f40..13cfef4ae22e 100644 --- a/arch/arm/crypto/poly1305-glue.c +++ b/arch/arm/crypto/poly1305-glue.c @@ -160,13 +160,20 @@ void poly1305_update_arch(struct poly1305_desc_ctx *dctx, const u8 *src, unsigned int len = round_down(nbytes, POLY1305_BLOCK_SIZE);
if (static_branch_likely(&have_neon) && do_neon) { - kernel_neon_begin(); - poly1305_blocks_neon(&dctx->h, src, len, 1); - kernel_neon_end(); + do { + unsigned int todo = min_t(unsigned int, len, SZ_4K); + + kernel_neon_begin(); + poly1305_blocks_neon(&dctx->h, src, todo, 1); + kernel_neon_end(); + + len -= todo; + src += todo; + } while (len); } else { poly1305_blocks_arm(&dctx->h, src, len, 1); + src += len; } - src += len; nbytes %= POLY1305_BLOCK_SIZE; }
diff --git a/arch/arm64/crypto/chacha-neon-glue.c b/arch/arm64/crypto/chacha-neon-glue.c index 37ca3e889848..af2bbca38e70 100644 --- a/arch/arm64/crypto/chacha-neon-glue.c +++ b/arch/arm64/crypto/chacha-neon-glue.c @@ -87,9 +87,17 @@ void chacha_crypt_arch(u32 *state, u8 *dst, const u8 *src, unsigned int bytes, !crypto_simd_usable()) return chacha_crypt_generic(state, dst, src, bytes, nrounds);
- kernel_neon_begin(); - chacha_doneon(state, dst, src, bytes, nrounds); - kernel_neon_end(); + do { + unsigned int todo = min_t(unsigned int, bytes, SZ_4K); + + kernel_neon_begin(); + chacha_doneon(state, dst, src, todo, nrounds); + kernel_neon_end(); + + bytes -= todo; + src += todo; + dst += todo; + } while (bytes); } EXPORT_SYMBOL(chacha_crypt_arch);
diff --git a/arch/arm64/crypto/poly1305-glue.c b/arch/arm64/crypto/poly1305-glue.c index e97b092f56b8..f33ada70c4ed 100644 --- a/arch/arm64/crypto/poly1305-glue.c +++ b/arch/arm64/crypto/poly1305-glue.c @@ -143,13 +143,20 @@ void poly1305_update_arch(struct poly1305_desc_ctx *dctx, const u8 *src, unsigned int len = round_down(nbytes, POLY1305_BLOCK_SIZE);
if (static_branch_likely(&have_neon) && crypto_simd_usable()) { - kernel_neon_begin(); - poly1305_blocks_neon(&dctx->h, src, len, 1); - kernel_neon_end(); + do { + unsigned int todo = min_t(unsigned int, len, SZ_4K); + + kernel_neon_begin(); + poly1305_blocks_neon(&dctx->h, src, todo, 1); + kernel_neon_end(); + + len -= todo; + src += todo; + } while (len); } else { poly1305_blocks(&dctx->h, src, len, 1); + src += len; } - src += len; nbytes %= POLY1305_BLOCK_SIZE; }
diff --git a/arch/x86/crypto/blake2s-glue.c b/arch/x86/crypto/blake2s-glue.c index 06ef2d4a4701..6737bcea1fa1 100644 --- a/arch/x86/crypto/blake2s-glue.c +++ b/arch/x86/crypto/blake2s-glue.c @@ -32,16 +32,16 @@ void blake2s_compress_arch(struct blake2s_state *state, const u32 inc) { /* SIMD disables preemption, so relax after processing each page. */ - BUILD_BUG_ON(PAGE_SIZE / BLAKE2S_BLOCK_SIZE < 8); + BUILD_BUG_ON(SZ_4K / BLAKE2S_BLOCK_SIZE < 8);
if (!static_branch_likely(&blake2s_use_ssse3) || !crypto_simd_usable()) { blake2s_compress_generic(state, block, nblocks, inc); return; }
- for (;;) { + do { const size_t blocks = min_t(size_t, nblocks, - PAGE_SIZE / BLAKE2S_BLOCK_SIZE); + SZ_4K / BLAKE2S_BLOCK_SIZE);
kernel_fpu_begin(); if (IS_ENABLED(CONFIG_AS_AVX512) && @@ -52,10 +52,8 @@ void blake2s_compress_arch(struct blake2s_state *state, kernel_fpu_end();
nblocks -= blocks; - if (!nblocks) - break; block += blocks * BLAKE2S_BLOCK_SIZE; - } + } while (nblocks); } EXPORT_SYMBOL(blake2s_compress_arch);
diff --git a/arch/x86/crypto/chacha_glue.c b/arch/x86/crypto/chacha_glue.c index b412c21ee06e..22250091cdbe 100644 --- a/arch/x86/crypto/chacha_glue.c +++ b/arch/x86/crypto/chacha_glue.c @@ -153,9 +153,17 @@ void chacha_crypt_arch(u32 *state, u8 *dst, const u8 *src, unsigned int bytes, bytes <= CHACHA_BLOCK_SIZE) return chacha_crypt_generic(state, dst, src, bytes, nrounds);
- kernel_fpu_begin(); - chacha_dosimd(state, dst, src, bytes, nrounds); - kernel_fpu_end(); + do { + unsigned int todo = min_t(unsigned int, bytes, SZ_4K); + + kernel_fpu_begin(); + chacha_dosimd(state, dst, src, todo, nrounds); + kernel_fpu_end(); + + bytes -= todo; + src += todo; + dst += todo; + } while (bytes); } EXPORT_SYMBOL(chacha_crypt_arch);
diff --git a/arch/x86/crypto/poly1305_glue.c b/arch/x86/crypto/poly1305_glue.c index 6dfec19f7d57..dfe921efa9b2 100644 --- a/arch/x86/crypto/poly1305_glue.c +++ b/arch/x86/crypto/poly1305_glue.c @@ -91,8 +91,8 @@ static void poly1305_simd_blocks(void *ctx, const u8 *inp, size_t len, struct poly1305_arch_internal *state = ctx;
/* SIMD disables preemption, so relax after processing each page. */ - BUILD_BUG_ON(PAGE_SIZE < POLY1305_BLOCK_SIZE || - PAGE_SIZE % POLY1305_BLOCK_SIZE); + BUILD_BUG_ON(SZ_4K < POLY1305_BLOCK_SIZE || + SZ_4K % POLY1305_BLOCK_SIZE);
if (!static_branch_likely(&poly1305_use_avx) || (len < (POLY1305_BLOCK_SIZE * 18) && !state->is_base2_26) || @@ -102,8 +102,8 @@ static void poly1305_simd_blocks(void *ctx, const u8 *inp, size_t len, return; }
- for (;;) { - const size_t bytes = min_t(size_t, len, PAGE_SIZE); + do { + const size_t bytes = min_t(size_t, len, SZ_4K);
kernel_fpu_begin(); if (IS_ENABLED(CONFIG_AS_AVX512) && static_branch_likely(&poly1305_use_avx512)) @@ -113,11 +113,10 @@ static void poly1305_simd_blocks(void *ctx, const u8 *inp, size_t len, else poly1305_blocks_avx(ctx, inp, bytes, padbit); kernel_fpu_end(); + len -= bytes; - if (!len) - break; inp += bytes; - } + } while (len); }
static void poly1305_simd_emit(void *ctx, u8 mac[POLY1305_DIGEST_SIZE],
Rather than chunking via PAGE_SIZE, this commit changes the arch implementations to chunk in explicit 4k parts, so that calculations on maximum acceptable latency don't suddenly become invalid on platforms where PAGE_SIZE isn't 4k, such as arm64.
Fixes: 0f961f9f670e ("crypto: x86/nhpoly1305 - add AVX2 accelerated NHPoly1305") Fixes: 012c82388c03 ("crypto: x86/nhpoly1305 - add SSE2 accelerated NHPoly1305") Fixes: a00fa0c88774 ("crypto: arm64/nhpoly1305 - add NEON-accelerated NHPoly1305") Fixes: 16aae3595a9d ("crypto: arm/nhpoly1305 - add NEON-accelerated NHPoly1305") Cc: Eric Biggers ebiggers@google.com Cc: stable@vger.kernel.org Signed-off-by: Jason A. Donenfeld Jason@zx2c4.com --- arch/arm/crypto/nhpoly1305-neon-glue.c | 2 +- arch/arm64/crypto/nhpoly1305-neon-glue.c | 2 +- arch/x86/crypto/nhpoly1305-avx2-glue.c | 2 +- arch/x86/crypto/nhpoly1305-sse2-glue.c | 2 +- 4 files changed, 4 insertions(+), 4 deletions(-)
diff --git a/arch/arm/crypto/nhpoly1305-neon-glue.c b/arch/arm/crypto/nhpoly1305-neon-glue.c index ae5aefc44a4d..ffa8d73fe722 100644 --- a/arch/arm/crypto/nhpoly1305-neon-glue.c +++ b/arch/arm/crypto/nhpoly1305-neon-glue.c @@ -30,7 +30,7 @@ static int nhpoly1305_neon_update(struct shash_desc *desc, return crypto_nhpoly1305_update(desc, src, srclen);
do { - unsigned int n = min_t(unsigned int, srclen, PAGE_SIZE); + unsigned int n = min_t(unsigned int, srclen, SZ_4K);
kernel_neon_begin(); crypto_nhpoly1305_update_helper(desc, src, n, _nh_neon); diff --git a/arch/arm64/crypto/nhpoly1305-neon-glue.c b/arch/arm64/crypto/nhpoly1305-neon-glue.c index 895d3727c1fb..c5405e6a6db7 100644 --- a/arch/arm64/crypto/nhpoly1305-neon-glue.c +++ b/arch/arm64/crypto/nhpoly1305-neon-glue.c @@ -30,7 +30,7 @@ static int nhpoly1305_neon_update(struct shash_desc *desc, return crypto_nhpoly1305_update(desc, src, srclen);
do { - unsigned int n = min_t(unsigned int, srclen, PAGE_SIZE); + unsigned int n = min_t(unsigned int, srclen, SZ_4K);
kernel_neon_begin(); crypto_nhpoly1305_update_helper(desc, src, n, _nh_neon); diff --git a/arch/x86/crypto/nhpoly1305-avx2-glue.c b/arch/x86/crypto/nhpoly1305-avx2-glue.c index f7567cbd35b6..80fcb85736e1 100644 --- a/arch/x86/crypto/nhpoly1305-avx2-glue.c +++ b/arch/x86/crypto/nhpoly1305-avx2-glue.c @@ -29,7 +29,7 @@ static int nhpoly1305_avx2_update(struct shash_desc *desc, return crypto_nhpoly1305_update(desc, src, srclen);
do { - unsigned int n = min_t(unsigned int, srclen, PAGE_SIZE); + unsigned int n = min_t(unsigned int, srclen, SZ_4K);
kernel_fpu_begin(); crypto_nhpoly1305_update_helper(desc, src, n, _nh_avx2); diff --git a/arch/x86/crypto/nhpoly1305-sse2-glue.c b/arch/x86/crypto/nhpoly1305-sse2-glue.c index a661ede3b5cf..cc6b7c1a2705 100644 --- a/arch/x86/crypto/nhpoly1305-sse2-glue.c +++ b/arch/x86/crypto/nhpoly1305-sse2-glue.c @@ -29,7 +29,7 @@ static int nhpoly1305_sse2_update(struct shash_desc *desc, return crypto_nhpoly1305_update(desc, src, srclen);
do { - unsigned int n = min_t(unsigned int, srclen, PAGE_SIZE); + unsigned int n = min_t(unsigned int, srclen, SZ_4K);
kernel_fpu_begin(); crypto_nhpoly1305_update_helper(desc, src, n, _nh_sse2);
On Wed, Apr 22, 2020 at 05:18:54PM -0600, Jason A. Donenfeld wrote:
Rather than chunking via PAGE_SIZE, this commit changes the arch implementations to chunk in explicit 4k parts, so that calculations on maximum acceptable latency don't suddenly become invalid on platforms where PAGE_SIZE isn't 4k, such as arm64.
Fixes: 0f961f9f670e ("crypto: x86/nhpoly1305 - add AVX2 accelerated NHPoly1305") Fixes: 012c82388c03 ("crypto: x86/nhpoly1305 - add SSE2 accelerated NHPoly1305") Fixes: a00fa0c88774 ("crypto: arm64/nhpoly1305 - add NEON-accelerated NHPoly1305") Fixes: 16aae3595a9d ("crypto: arm/nhpoly1305 - add NEON-accelerated NHPoly1305") Cc: Eric Biggers ebiggers@google.com Cc: stable@vger.kernel.org Signed-off-by: Jason A. Donenfeld Jason@zx2c4.com
arm64 normally uses PAGE_SIZE == 4k, so this commit message is a little misleading. Anyway, I agree with using 4k, so:
Reviewed-by: Eric Biggers ebiggers@google.com
- Eric
On Wed, Apr 22, 2020 at 05:18:53PM -0600, Jason A. Donenfeld wrote:
The initial Zinc patchset, after some mailing list discussion, contained code to ensure that kernel_fpu_enable would not be kept on for more than a 4k chunk, since it disables preemption. The choice of 4k isn't totally scientific, but it's not a bad guess either, and it's what's used in both the x86 poly1305, blake2s, and nhpoly1305 code already (in the form of PAGE_SIZE, which this commit corrects to be explicitly 4k for the former two).
Ard did some back of the envelope calculations and found that at 5 cycles/byte (overestimate) on a 1ghz processor (pretty slow), 4k means we have a maximum preemption disabling of 20us, which Sebastian confirmed was probably a good limit.
Unfortunately the chunking appears to have been left out of the final patchset that added the glue code. So, this commit adds it back in.
Fixes: 84e03fa39fbe ("crypto: x86/chacha - expose SIMD ChaCha routine as library function") Fixes: b3aad5bad26a ("crypto: arm64/chacha - expose arm64 ChaCha routine as library function") Fixes: a44a3430d71b ("crypto: arm/chacha - expose ARM ChaCha routine as library function") Fixes: d7d7b8535662 ("crypto: x86/poly1305 - wire up faster implementations for kernel") Fixes: f569ca164751 ("crypto: arm64/poly1305 - incorporate OpenSSL/CRYPTOGAMS NEON implementation") Fixes: a6b803b3ddc7 ("crypto: arm/poly1305 - incorporate OpenSSL/CRYPTOGAMS NEON implementation") Fixes: ed0356eda153 ("crypto: blake2s - x86_64 SIMD implementation") Cc: Eric Biggers ebiggers@google.com Cc: Ard Biesheuvel ardb@kernel.org Cc: Sebastian Andrzej Siewior bigeasy@linutronix.de Cc: stable@vger.kernel.org Signed-off-by: Jason A. Donenfeld Jason@zx2c4.com
Changes v2->v3:
- [Eric] Split nhpoly1305 changes into separate commit, since it's not related to the library interface.
Changes v1->v2:
- [Ard] Use explicit 4k chunks instead of PAGE_SIZE.
- [Eric] Prefer do-while over for (;;).
arch/arm/crypto/chacha-glue.c | 14 +++++++++++--- arch/arm/crypto/poly1305-glue.c | 15 +++++++++++---- arch/arm64/crypto/chacha-neon-glue.c | 14 +++++++++++--- arch/arm64/crypto/poly1305-glue.c | 15 +++++++++++---- arch/x86/crypto/blake2s-glue.c | 10 ++++------ arch/x86/crypto/chacha_glue.c | 14 +++++++++++--- arch/x86/crypto/poly1305_glue.c | 13 ++++++------- 7 files changed, 65 insertions(+), 30 deletions(-)
All applied. Thanks.
linux-stable-mirror@lists.linaro.org