On Thu, Apr 24, 2025 at 1:47 AM Cosmin Ratiu cratiu@nvidia.com wrote:
On Wed, 2025-04-23 at 10:30 -0700, Mina Almasry wrote:
On Wed, Apr 23, 2025 at 9:03 AM Cosmin Ratiu cratiu@nvidia.com wrote:
Drivers that are told to allocate RX buffers from pools of DMA memory should have enough memory in the pool to satisfy projected allocation requests (a function of ring size, MTU & other parameters). If there's not enough memory, RX ring refill might fail later at inconvenient times (e.g. during NAPI poll).
My understanding is that if the RX ring refill fails, the driver will post the buffers it was able to allocate data for, and will not post other buffers. So it will run with a degraded performance but nothing overly bad should happen. This should be the same behavior if the machine is under memory pressure.
What motivated this change was a failure in how mlx5 refills rings today. For efficiency, ring refill triggered by NAPI polling only releases old buffers just before allocating new ones so effectively has a built-in assumption that the ring can be filled. Commit 4c2a13236807 ("net/mlx5e: RX, Defer page release in striding rq for better recycling") is an interesting read here.
For more details, see the do{ }while loop in mlx5e_post_rx_mpwqes, where mlx5e_free_rx_mpwqe then mlx5e_alloc_rx_mpwqe are called to free the old buffer and reallocate a new one at the same position. This has excellent cache-locality and the pages returned to the pool will be reused by the new descriptor.
Thanks for the detailed explanation. These seem like a clever optimization.
The bug in mlx5 is that with a large MTU & ring size, the ring cannot be fully populated with rx descriptors because the pool doesn't have enough memory, but there's no memory released back to the pool for new ones. Eventually, rx descriptors are exhausted and traffic stops.
In general I don't know about this change. If the user wants to use very small dmabufs, they should be able to, without going through hoops reducing the number of rx ring slots the driver has (if it supports configuring that).
I think maybe printing an error or warning that the dmabuf is too small for the pool_size may be fine. But outright failing this configuration? I don't think so.
For regular memory-backed page pools, there's no hard limit of how big they can become (except available physical memory), so this issue was not seen before.
I didn't look at other drivers yet, but is it expected that drivers operate with incompletely filled rings? I assumed that since the user configured a specified ring size, it expects drivers to be able to use that size and not silently operate in degraded mode, with a smaller ring size.
If you think drivers should work in degraded mode, we can look at improving the ring population code to work better in this scenario.
You're probably a bigger expert to me on what drivers should do in general, but yes, this seems like an mlx5 limitation, not a general limitation to all drivers. GVE for example, I think, has a host of optimization for memory pressure scenarios that makes it resilient to this. I ran this test:
mina-3 /home/almasrymina_google_com # ethtool -g eth0 Ring parameters for eth0: ... Current hardware settings: ... RX: 1024
Then hacked ncdevmem to only provide a 64 page udmabuf:
diff --git a/tools/testing/selftests/drivers/net/hw/ncdevmem.c b/tools/testing/selftests/drivers/net/hw/ncdevmem.c index 1f9fb0b1cb811..6de64f7680241 100644 --- a/tools/testing/selftests/drivers/net/hw/ncdevmem.c +++ b/tools/testing/selftests/drivers/net/hw/ncdevmem.c @@ -76,7 +76,7 @@
#define PAGE_SHIFT 12 #define TEST_PREFIX "ncdevmem" -#define NUM_PAGES 16000 +#define NUM_PAGES 64
#ifndef MSG_SOCK_DEVMEM #define MSG_SOCK_DEVMEM 0x2000000
To my surprise, the test passed just fine. Seems the limitation at least doesn't apply to GVE. I don't know what the rest of the drivers do, but so far this seems like driver specific behavior. I think putting limitations in the core stack for mlx5 issues doesn't seem great.
pool_size seems to be the number of ptr_ring slots in the page_pool, not some upper or lower bound on the amount of memory the page_pool can provide. So this check seems useless? The page_pool can still not provide this amount of memory with dmabuf (if the netmems aren't being recycled fast enough) or with normal memory (under memory pressure).
I think pool_size is usually set by drivers based on their params, and it's the max size of pool->ring. The opportunistic check I added compares this demand with the supply (available chunk memory) and fails this config based on the assumption that there should be enough memory in the pool to satisfy driver needs.
Please let me know your thoughts and how to proceed.
I think there are better options here:
1. In the page_pool, warn if the dmabuf is too small for the ring size, but don't outright prevent the configuration. If the user is running on a driver that doesn't have a dmabuf size limitation, let them ignore the warning and run it.
2. In mlx5 code, somehow find out how big the dmabuf size is (it may need a new pp api), and then in mlx5 code prevent this configuration to workaround your (I think) driver-specific limitation.
3. Maybe address the general limitation in mlx5, and make it work even if it can't refill its rings? It would help this case and other memory pressure scenarios as well.