On Wed, 2025-04-23 at 10:30 -0700, Mina Almasry wrote:
On Wed, Apr 23, 2025 at 9:03 AM Cosmin Ratiu cratiu@nvidia.com wrote:
Drivers that are told to allocate RX buffers from pools of DMA memory should have enough memory in the pool to satisfy projected allocation requests (a function of ring size, MTU & other parameters). If there's not enough memory, RX ring refill might fail later at inconvenient times (e.g. during NAPI poll).
My understanding is that if the RX ring refill fails, the driver will post the buffers it was able to allocate data for, and will not post other buffers. So it will run with a degraded performance but nothing overly bad should happen. This should be the same behavior if the machine is under memory pressure.
What motivated this change was a failure in how mlx5 refills rings today. For efficiency, ring refill triggered by NAPI polling only releases old buffers just before allocating new ones so effectively has a built-in assumption that the ring can be filled. Commit 4c2a13236807 ("net/mlx5e: RX, Defer page release in striding rq for better recycling") is an interesting read here.
For more details, see the do{ }while loop in mlx5e_post_rx_mpwqes, where mlx5e_free_rx_mpwqe then mlx5e_alloc_rx_mpwqe are called to free the old buffer and reallocate a new one at the same position. This has excellent cache-locality and the pages returned to the pool will be reused by the new descriptor.
The bug in mlx5 is that with a large MTU & ring size, the ring cannot be fully populated with rx descriptors because the pool doesn't have enough memory, but there's no memory released back to the pool for new ones. Eventually, rx descriptors are exhausted and traffic stops.
In general I don't know about this change. If the user wants to use very small dmabufs, they should be able to, without going through hoops reducing the number of rx ring slots the driver has (if it supports configuring that).
I think maybe printing an error or warning that the dmabuf is too small for the pool_size may be fine. But outright failing this configuration? I don't think so.
For regular memory-backed page pools, there's no hard limit of how big they can become (except available physical memory), so this issue was not seen before.
I didn't look at other drivers yet, but is it expected that drivers operate with incompletely filled rings? I assumed that since the user configured a specified ring size, it expects drivers to be able to use that size and not silently operate in degraded mode, with a smaller ring size.
If you think drivers should work in degraded mode, we can look at improving the ring population code to work better in this scenario.
pool_size seems to be the number of ptr_ring slots in the page_pool, not some upper or lower bound on the amount of memory the page_pool can provide. So this check seems useless? The page_pool can still not provide this amount of memory with dmabuf (if the netmems aren't being recycled fast enough) or with normal memory (under memory pressure).
I think pool_size is usually set by drivers based on their params, and it's the max size of pool->ring. The opportunistic check I added compares this demand with the supply (available chunk memory) and fails this config based on the assumption that there should be enough memory in the pool to satisfy driver needs.
Please let me know your thoughts and how to proceed.
Cosmin.