Re: [PATCH] crypto: qat - lower priority for skcipher and aead algorithms

19 Jun 2025


      On Tue, Jun 17, 2025 at 12:57:05PM +0800, Herbert Xu wrote:
...
On Mon, Jun 16, 2025 at 04:02:30PM +0100, Giovanni Cabiddu wrote:
...
This level of performance is observed in userspace, where it is possible
to (1) batch requests to amortize MMIO overhead (e.g., multiple requests
per write), (2) submit requests asynchronously, (3) use flat buffers
instead of scatter-gather lists, and (4) rely on polling rather than
interrupts.
So is batching a large number of 4K requests requests sufficient
to achieve the maximum throughput? Or does it require physically
contiguous memory much greater than 4K in size?
Yes, batching a large number of 4KiB requests is sufficient to achieve
near-maximum throughput.
In an experiment using the skcipher APIs in asynchronous mode, I was
able to reach approximately 11 GB/s throughput with 4KiB buffers. To
achieve this, I had to increase the request queue depth and adjust the
interrupt coalescing timer, which is set quite high by default.
I'm continuing to experiment. For example, I modified the code to send a
direct pointer to the device when the source and destination scatterlist
entries each contain only a single segment. This should reduce I/O
overhead by avoiding the need to read the scatter-gather list
descriptors.
Regarding the synchronous use case, preliminary analysis shows that the
main bottlenecks are: (1) interrupt handling — particularly the overhead
of completion handling, with significant time spent in the tasklet
executing crypto_req_done() and (2) latency waiting on the device. I'm
exploring ways to improve these.
While this work might seem moot given that AES is faster in the core,
the same optimizations are applicable to the compression service, where
QAT can still provide benefits.
Regards,
-- 
Giovanni

2025

2024

2023

2022

2021

2020

2019

2018

2017

Re: [PATCH] crypto: qat - lower priority for skcipher and aead algorithms