On Tue, 7 Nov 2023 14:23:20 -0800 Stanislav Fomichev wrote:
> Can we mark a socket as devmem-only? Do we have any use-case for those
> hybrid setups? Or, let me put it that way: do we expect API callers
> to handle both linear and non-linear cases correctly?
> As a consumer of the previous versions of these apis internally,
> I find all those corner cases confusing :-( Hence trying to understand
> whether we can make it a bit more rigid and properly defined upstream.
FWIW I'd also prefer to allow mixing. "Some NICs" can decide HDS
very flexibly, incl. landing full jumbo frames into the "headers".
There's no sender API today to signal how to mark the data for
selective landing, but if Mina already has the rx side written
to allow that...
On Sun, 5 Nov 2023 18:44:02 -0800 Mina Almasry wrote:
> + -
> + name: queues
> + doc: receive queues to bind the dma-buf to.
> + type: u32
> + multi-attr: true
I think that you should throw in the queue type.
I know you made the op imply RX:
> + -
> + name: bind-rx
but if we decide to create a separate "type" for some magic
queue type one day we'll have to ponder how to extend this API
IMHO queue should be identified by a <type, id> tuple, always.
On Sun, 5 Nov 2023 18:44:05 -0800 Mina Almasry wrote:
> +static int mp_dmabuf_devmem_init(struct page_pool *pool)
> +{
> + struct netdev_dmabuf_binding *binding = pool->mp_priv;
> +
> + if (!binding)
> + return -EINVAL;
> +
> + if (pool->p.flags & PP_FLAG_DMA_MAP ||
> + pool->p.flags & PP_FLAG_DMA_SYNC_DEV)
> + return -EOPNOTSUPP;
This looks backwards, we should _force_ the driver to use the dma
mapping built into the page pool APIs, to isolate the driver from
how the DMA addr actually gets obtained. Right?
Maybe seeing driver patches would illuminate.
On Thu, 09 Nov 2023 12:05:37 +0100 Paolo Abeni wrote:
> > I suppose we just disagree on the elegance of the API.
>
> FWIW, I think sockopt +cmsg is the right API.
FWIW it's fine by me as well.
My brain is slightly fried after trying to catch up on the thread
for close to 2h. So forgive me if I'm missing something.
This applies to all emails I'm about to send :)
On Sun, 5 Nov 2023 18:44:11 -0800 Mina Almasry wrote:
> + trigger_device_reset();
The user space must not be responsible for the reset.
We can add some temporary "recreate page pools" ndo
until the queue API is ready.
But it should not be visible to the user in any way.
And then the kernel can issue the same reset when the netlink
socket dies to flush device free lists.
Maybe we should also add a "allow device/all-queues reload" flag
to the netlink API to differentiate drivers which can't implement
full queue API later on. We want to make sure the defaults work well
in our "target design", rather than at the first stage. And target
design will reload queues one by one.
Am 09.11.23 um 04:20 schrieb Mina Almasry:
> [SNIP]
>> But we might be able to do something as folio is doing now, mm subsystem
>> is still seeing 'struct folio/page', but other subsystem like slab is using
>> 'struct slab', and there is still some common fields shared between
>> 'struct folio' and 'struct slab'.
>>
> In my eyes this is almost exactly what I suggested in RFC v1 and got
> immediately nacked with no room to negotiate. What we did for v1 is to
> allocate struct pages for dma-buf to make dma-bufs look like struct
> page to mm subsystem. Almost exactly what you're describing above.
> It's a no-go. I don't think renaming struct page to netmem is going to
> move the needle (it also re-introduces code-churn). What I feel like I
> learnt is that dma-bufs are not struct pages and can't be made to look
> like one, I think.
Yeah, that pretty much hits the nail on the head. What was not mentioned
yet and you could potentially try to do is to go the other way around.
In other words instead of importing a DMA-buf file descriptor into the
page-pool, you take some netdev page-pool pages and export them as
DMA-buf handle.
All you then need to do is to implement the necessary DMA-buf interface,
e.g. wait for DMA-fences before accessing stuff, when you have an async
operation install a DMA-fence so that other can wait for your operation
to finish etc.. etc..
This still doesn't gives you peer 2 peer over for example the PCIe bus,
but it would give you zero copy in the sense that a GPU or other
acceleration device directly writes stuff into memory a network device
can work with.
We already have some similar functionality for at least Intel and AMD hw
where an userptr mechanism is used to make malloced memory (backed by
normal struct pages) available to the GPU. The problem with this
approach is that most GPUs currently can't do good recoverable page
faults which in turn makes the whole MMU notifier based approach just
horrible inefficient.
Just giving a few more pointers you might want to look into...
Cheers,
Christian.
From: Mina Almasry
> Sent: 06 November 2023 02:44
>
> For device memory TCP, we expect the skb headers to be available in host
> memory for access, and we expect the skb frags to be in device memory
> and unaccessible to the host. We expect there to be no mixing and
> matching of device memory frags (unaccessible) with host memory frags
> (accessible) in the same skb.
>
> Add a skb->devmem flag which indicates whether the frags in this skb
> are device memory frags or not.
>
...
> diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
> index 1fae276c1353..8fb468ff8115 100644
> --- a/include/linux/skbuff.h
> +++ b/include/linux/skbuff.h
> @@ -805,6 +805,8 @@ typedef unsigned char *sk_buff_data_t;
> * @csum_level: indicates the number of consecutive checksums found in
> * the packet minus one that have been verified as
> * CHECKSUM_UNNECESSARY (max 3)
> + * @devmem: indicates that all the fragments in this skb are backed by
> + * device memory.
> * @dst_pending_confirm: need to confirm neighbour
> * @decrypted: Decrypted SKB
> * @slow_gro: state present at GRO time, slower prepare step required
> @@ -991,7 +993,7 @@ struct sk_buff {
> #if IS_ENABLED(CONFIG_IP_SCTP)
> __u8 csum_not_inet:1;
> #endif
> -
> + __u8 devmem:1;
> #if defined(CONFIG_NET_SCHED) || defined(CONFIG_NET_XGRESS)
> __u16 tc_index; /* traffic control index */
> #endif
> @@ -1766,6 +1768,12 @@ static inline void skb_zcopy_downgrade_managed(struct sk_buff *skb)
> __skb_zcopy_downgrade_managed(skb);
> }
Doesn't that bloat struct sk_buff?
I'm not sure there are any spare bits available.
Although CONFIG_NET_SWITCHDEV and CONFIG_NET_SCHED seem to
already add padding.
David
-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)