Re: [PATCH] crypto: qat - lower priority for skcipher and aead algorithms

16 Jun 2025


      On Mon, Jun 16, 2025 at 04:02:30PM +0100, Giovanni Cabiddu wrote:
...
On Mon, Jun 16, 2025 at 12:18:02PM +0800, Herbert Xu wrote:
...
On Fri, Jun 13, 2025 at 11:32:27AM +0100, Giovanni Cabiddu wrote:
...
Most kernel applications utilizing the crypto API operate synchronously
and on small buffer sizes, therefore do not benefit from QAT acceleration.
So what performance numbers should we be getting with QAT if the
buffer sizes were large enough?
Specifically for AES128-XTS, under optimal conditions, the current
generation of QAT (GEN4) can achieve approximately 12 GB/s throughput at
4KB block sizes using a single device. Systems typically include between
1 and 4 QAT devices per socket and each device contains two internal
engines capable of performing that algorithm.
This level of performance is observed in userspace, where it is possible
to (1) batch requests to amortize MMIO overhead (e.g., multiple requests
per write), (2) submit requests asynchronously, (3) use flat buffers
instead of scatter-gather lists, and (4) rely on polling rather than
interrupts.
However, in the kernel, we are currently unable to keep the accelerator
sufficiently busy. For example, using a synthetic synchronous and single
threaded benchmark on a Sapphire Rapids system, with interrupts properly
affinitized, I observed throughput of around 500 Mbps with 4KB buffers.
Debugfs statistics (telemetry) indicated that the accelerator was
utilized at only ~4%.
Given this, VAES is currently the more suitable choice for kernel use
cases. The patch to lower the priority of QAT's symmetric crypto
algorithms reflects this practical reality. The original high priority
(4001) was set when the driver was first upstreamed in 2014 and had not
been revisited until now.
For some perspective, encrypting or decrypting 4 KiB messages with AES-128-XTS
serially, I get 18.4 GB/s per thread with the VAES-accelerated code on an Intel
Emerald Rapids processor.  (The code is arch/x86/crypto/aes-xts-avx-x86_64.S,
which I wrote and contributed in Linux 6.10.)  The processor appeared to be
running at about 3.28 GHz.  That's about 5.6 bytes per cycle.
Emerald Rapids processors have 6 to 60 cores per socket.  Even assuming that a
second thread in each core provides no benefit due to competing for the same
core's resources, that would be an AES-128-XTS throughput of 110 to 1100 GB/s.
That's way more than QAT could provide, even under the optimal conditions which
do not exist in reality as QAT is much harder to use than VAES.
FWIW, on an AMD EPYC 9B45 (Zen 5 / Turin) server processor, I get 35.2 GB/s.
This processor appeared to run at about 4.15 GHz, so that's about 8.5 bytes per
cycle.  That's 51% more bytes per cycle than Intel.  This shows that there is
still room for improvement in VAES, even when it's already much better than QAT.
It's unclear why Intel's efforts seem to be focused on QAT instead of VAES.
- Eric

2025

2024

2023

2022

2021

2020

2019

2018

2017

Re: [PATCH] crypto: qat - lower priority for skcipher and aead algorithms