On Mon, Jun 16, 2025 at 04:02:30PM +0100, Giovanni Cabiddu wrote:
On Mon, Jun 16, 2025 at 12:18:02PM +0800, Herbert Xu wrote:
On Fri, Jun 13, 2025 at 11:32:27AM +0100, Giovanni Cabiddu wrote:
Most kernel applications utilizing the crypto API operate synchronously and on small buffer sizes, therefore do not benefit from QAT acceleration.
So what performance numbers should we be getting with QAT if the buffer sizes were large enough?
Specifically for AES128-XTS, under optimal conditions, the current generation of QAT (GEN4) can achieve approximately 12 GB/s throughput at 4KB block sizes using a single device. Systems typically include between 1 and 4 QAT devices per socket and each device contains two internal engines capable of performing that algorithm.
This level of performance is observed in userspace, where it is possible to (1) batch requests to amortize MMIO overhead (e.g., multiple requests per write), (2) submit requests asynchronously, (3) use flat buffers instead of scatter-gather lists, and (4) rely on polling rather than interrupts.
However, in the kernel, we are currently unable to keep the accelerator sufficiently busy. For example, using a synthetic synchronous and single threaded benchmark on a Sapphire Rapids system, with interrupts properly affinitized, I observed throughput of around 500 Mbps with 4KB buffers. Debugfs statistics (telemetry) indicated that the accelerator was utilized at only ~4%.
Given this, VAES is currently the more suitable choice for kernel use cases. The patch to lower the priority of QAT's symmetric crypto algorithms reflects this practical reality. The original high priority (4001) was set when the driver was first upstreamed in 2014 and had not been revisited until now.
For some perspective, encrypting or decrypting 4 KiB messages with AES-128-XTS serially, I get 18.4 GB/s per thread with the VAES-accelerated code on an Intel Emerald Rapids processor. (The code is arch/x86/crypto/aes-xts-avx-x86_64.S, which I wrote and contributed in Linux 6.10.) The processor appeared to be running at about 3.28 GHz. That's about 5.6 bytes per cycle.
Emerald Rapids processors have 6 to 60 cores per socket. Even assuming that a second thread in each core provides no benefit due to competing for the same core's resources, that would be an AES-128-XTS throughput of 110 to 1100 GB/s.
That's way more than QAT could provide, even under the optimal conditions which do not exist in reality as QAT is much harder to use than VAES.
FWIW, on an AMD EPYC 9B45 (Zen 5 / Turin) server processor, I get 35.2 GB/s. This processor appeared to run at about 4.15 GHz, so that's about 8.5 bytes per cycle. That's 51% more bytes per cycle than Intel. This shows that there is still room for improvement in VAES, even when it's already much better than QAT.
It's unclear why Intel's efforts seem to be focused on QAT instead of VAES.
- Eric