On Thu, Nov 9, 2023 at 11:38 PM Yunsheng Lin linyunsheng@huawei.com wrote:
On 2023/11/10 10:59, Mina Almasry wrote:
On Thu, Nov 9, 2023 at 12:30 AM Paolo Abeni pabeni@redhat.com wrote:
I'm trying to wrap my head around the whole infra... the above line is confusing. Why do you increment dma_addr? it will be re-initialized in the next iteration.
That is just a mistake, sorry. Will remove this increment.
You seems to be combining comments in different thread and replying in one thread, I am not sure that is a good practice and I almost missed the reply below as I don't seem to be cc'ed.
Sorry about that.
On Thu, Nov 9, 2023 at 1:29 AM Yunsheng Lin linyunsheng@huawei.com wrote:> >>>
gen_pool_destroy BUG_ON() if it's not empty at the time of destroying. Technically that should never happen, because __netdev_devmem_binding_free() should only be called when the refcount hits 0, so all the chunks have been freed back to the gen_pool. But, just in case, I don't want to crash the server just because I'm leaking a chunk... this is a bit of defensive programming that is typically frowned upon, but the behavior of gen_pool is so severe I think the WARN() + check is warranted here.
It seems it is pretty normal for the above to happen nowadays because of retransmits timeouts, NAPI defer schemes mentioned below:
https://lkml.kernel.org/netdev/168269854650.2191653.8465259808498269815.stgi...
And currently page pool core handles that by using a workqueue.
Forgive me but I'm not understanding the concern here.
__netdev_devmem_binding_free() is called when binding->ref hits 0.
binding->ref is incremented when an iov slice of the dma-buf is allocated, and decremented when an iov is freed. So, __netdev_devmem_binding_free() can't really be called unless all the iovs have been freed, and gen_pool_size() == gen_pool_avail(), regardless of what's happening on the page_pool side of things, right?
I seems to misunderstand it. In that case, it seems to be about defensive programming like other checking.
By looking at it more closely, it seems napi_frag_unref() call page_pool_page_put_many() directly, which means devmem seems to be bypassing the napi_safe optimization.
Can napi_frag_unref() reuse napi_pp_put_page() in order to reuse the napi_safe optimization?
I think it already does. page_pool_page_put_many() is only called if !recycle or !napi_pp_put_page(). In that case page_pool_page_put_many() is just a replacement for put_page(), because this 'page' may be an iov.
Is there a reason why not calling napi_pp_put_page() for devmem too instead of calling page_pool_page_put_many()? mem provider has a 'release_page' ops, calling page_pool_page_put_many() directly here seems to be bypassing the 'release_page' ops, which means devmem is bypassing most of the main features of page pool.
I think we're still calling napi_pp_put_page() as normal:
/** @@ -3441,13 +3466,13 @@ bool napi_pp_put_page(struct page *page, bool napi_safe); static inline void napi_frag_unref(skb_frag_t *frag, bool recycle, bool napi_safe) { - struct page *page = skb_frag_page(frag); - #ifdef CONFIG_PAGE_POOL - if (recycle && napi_pp_put_page(page, napi_safe)) + if (recycle && napi_pp_put_page(frag->bv_page, napi_safe)) return; + page_pool_page_put_many(frag->bv_page, 1); +#else + put_page(skb_frag_page(frag)); #endif - put_page(page); }
The code change here is to replace put_page() with page_pool_page_put_many(), only, because bv_page may be a page_pool_iov, so we need to use page_pool_page_put_many() which handles page_pool_iov correcly. I did not change whether or not napi_pp_put_page() is called. It's still called if recycle==true.
As far as I can tell, the main features of page pool:
- Allow lockless allocation and freeing in pool->alloc cache by utilizing NAPI non-concurrent context.
- Allow concurrent allocation and freeing in pool->ring cache by utilizing ptr_ring.
- Allow dma map/unmap and cache sync optimization.
- Allow detailed stats logging and tracing.
- Allow some bulk allocation and freeing.
- support both skb packet and xdp frame.
I am wondering what is the main features that devmem is utilizing by intergrating into page pool?
It seems the driver can just call netdev_alloc_devmem() and napi_frag_unref() can call netdev_free_devmem() directly without intergrating into page pool and it should just works too?
Maybe we should consider creating a new thin layer, in order to demux to page pool, devmem or other mem type if my suggestion does not work out too?
I went through this discussion with Jesper on RFC v2 in this thread:
https://lore.kernel.org/netdev/CAHS8izOVJGJH5WF68OsRWFKJid1_huzzUK+hpKbLcL4p...
which culminates with that email where he seems on board with the change from a performance POV, and seems on board with hiding the memory type implementation from the drivers. That thread fully goes over the tradeoffs of integrating over the page pool or creating new ones. Integrating with the page pool abstracts most of the devmem implementation (and other memory types) from the driver. It reuses page pool features like page recycling for example.