On Tue, Jun 17, 2025 at 12:57:05PM +0800, Herbert Xu wrote:
On Mon, Jun 16, 2025 at 04:02:30PM +0100, Giovanni Cabiddu wrote:
This level of performance is observed in userspace, where it is possible to (1) batch requests to amortize MMIO overhead (e.g., multiple requests per write), (2) submit requests asynchronously, (3) use flat buffers instead of scatter-gather lists, and (4) rely on polling rather than interrupts.
So is batching a large number of 4K requests requests sufficient to achieve the maximum throughput? Or does it require physically contiguous memory much greater than 4K in size?
Yes, batching a large number of 4KiB requests is sufficient to achieve near-maximum throughput.
In an experiment using the skcipher APIs in asynchronous mode, I was able to reach approximately 11 GB/s throughput with 4KiB buffers. To achieve this, I had to increase the request queue depth and adjust the interrupt coalescing timer, which is set quite high by default.
I'm continuing to experiment. For example, I modified the code to send a direct pointer to the device when the source and destination scatterlist entries each contain only a single segment. This should reduce I/O overhead by avoiding the need to read the scatter-gather list descriptors.
Regarding the synchronous use case, preliminary analysis shows that the main bottlenecks are: (1) interrupt handling — particularly the overhead of completion handling, with significant time spent in the tasklet executing crypto_req_done() and (2) latency waiting on the device. I'm exploring ways to improve these.
While this work might seem moot given that AES is faster in the core, the same optimizations are applicable to the compression service, where QAT can still provide benefits.
Regards,