Re: [PATCH net 1/2] net/devmem: Reject insufficiently large dmabuf pools

24 Apr 2025

On Thu, Apr 24, 2025 at 1:47 AM Cosmin Ratiu cratiu@nvidia.com wrote:
...
On Wed, 2025-04-23 at 10:30 -0700, Mina Almasry wrote:
...
On Wed, Apr 23, 2025 at 9:03 AM Cosmin Ratiu cratiu@nvidia.com
wrote:
...
Drivers that are told to allocate RX buffers from pools of DMA
memory
should have enough memory in the pool to satisfy projected
allocation
requests (a function of ring size, MTU & other parameters). If
there's
not enough memory, RX ring refill might fail later at inconvenient
times
(e.g. during NAPI poll).
My understanding is that if the RX ring refill fails, the driver will
post the buffers it was able to allocate data for, and will not post
other buffers. So it will run with a degraded performance but nothing
overly bad should happen. This should be the same behavior if the
machine is under memory pressure.
What motivated this change was a failure in how mlx5 refills rings
today. For efficiency, ring refill triggered by NAPI polling only
releases old buffers just before allocating new ones so effectively has
a built-in assumption that the ring can be filled. Commit 4c2a13236807
("net/mlx5e: RX, Defer page release in striding rq for better
recycling") is an interesting read here.
For more details, see the do{ }while loop in mlx5e_post_rx_mpwqes,
where mlx5e_free_rx_mpwqe then mlx5e_alloc_rx_mpwqe are called to free
the old buffer and reallocate a new one at the same position. This has
excellent cache-locality and the pages returned to the pool will be
reused by the new descriptor.
Thanks for the detailed explanation. These seem like a clever optimization.
...
The bug in mlx5 is that with a large MTU & ring size, the ring cannot
be fully populated with rx descriptors because the pool doesn't have
enough memory, but there's no memory released back to the pool for new
ones. Eventually, rx descriptors are exhausted and traffic stops.
...
In general I don't know about this change. If the user wants to use
very small dmabufs, they should be able to, without going through
hoops reducing the number of rx ring slots the driver has (if it
supports configuring that).
I think maybe printing an error or warning that the dmabuf is too
small for the pool_size may be fine. But outright failing this
configuration? I don't think so.
For regular memory-backed page pools, there's no hard limit of how big
they can become (except available physical memory), so this issue was
not seen before.
I didn't look at other drivers yet, but is it expected that drivers
operate with incompletely filled rings? I assumed that since the user
configured a specified ring size, it expects drivers to be able to use
that size and not silently operate in degraded mode, with a smaller
ring size.
If you think drivers should work in degraded mode, we can look at
improving the ring population code to work better in this scenario.
You're probably a bigger expert to me on what drivers should do in
general, but yes, this seems like an mlx5 limitation, not a general
limitation to all drivers. GVE for example, I think, has a host of
optimization for memory pressure scenarios that makes it resilient to
this. I ran this test:
mina-3 /home/almasrymina_google_com # ethtool -g eth0
Ring parameters for eth0:
...
Current hardware settings:
...
RX:             1024
Then hacked ncdevmem to only provide a 64 page udmabuf:

diff --git a/tools/testing/selftests/drivers/net/hw/ncdevmem.c
b/tools/testing/selftests/drivers/net/hw/ncdevmem.c
index 1f9fb0b1cb811..6de64f7680241 100644
--- a/tools/testing/selftests/drivers/net/hw/ncdevmem.c
+++ b/tools/testing/selftests/drivers/net/hw/ncdevmem.c
@@ -76,7 +76,7 @@
#define PAGE_SHIFT 12
 #define TEST_PREFIX "ncdevmem"
-#define NUM_PAGES 16000
+#define NUM_PAGES 64
#ifndef MSG_SOCK_DEVMEM
 #define MSG_SOCK_DEVMEM 0x2000000
To my surprise, the test passed just fine. Seems the limitation at
least doesn't apply to GVE. I don't know what the rest of the drivers
do, but so far this seems like driver specific behavior. I think
putting limitations in the core stack for mlx5 issues doesn't seem
great.
...
...
pool_size seems to be the number of ptr_ring slots in the page_pool,
not some upper or lower bound on the amount of memory the page_pool
can provide. So this check seems useless? The page_pool can still not
provide this amount of memory with dmabuf (if the netmems aren't
being
recycled fast enough) or with normal memory (under memory pressure).
I think pool_size is usually set by drivers based on their params, and
it's the max size of pool->ring. The opportunistic check I added
compares this demand with the supply (available chunk memory) and fails
this config based on the assumption that there should be enough memory
in the pool to satisfy driver needs.
Please let me know your thoughts and how to proceed.
I think there are better options here:
1. In the page_pool, warn if the dmabuf is too small for the ring
size, but don't outright prevent the configuration. If the user is
running on a driver that doesn't have a dmabuf size limitation, let
them ignore the warning and run it.
2. In mlx5 code, somehow find out how big the dmabuf size is (it may
need a new pp api), and then in mlx5 code prevent this configuration
to workaround your (I think) driver-specific limitation.
3. Maybe address the general limitation in mlx5, and make it work even
if it can't refill its rings? It would help this case and other memory
pressure scenarios as well.
-- 
Thanks,
Mina


    

2025

2024

2023

2022

2021

2020

2019

2018

2017

Re: [PATCH net 1/2] net/devmem: Reject insufficiently large dmabuf pools