Currently, vmxnet3 uses GRO callback only if LRO is disabled. However, on smartNic based setups where UPT is supported, LRO can be enabled from guest VM but UPT devicve does not support LRO as of now. In such cases, there can be performance degradation as GRO is not being done.
This patch fixes this issue by calling GRO API when UPT is enabled. We use updateRxProd to determine if UPT mode is active or not.
Cc: stable@vger.kernel.org Fixes: 6f91f4ba046e ("vmxnet3: add support for capability registers") Signed-off-by: Ronak Doshi doshir@vmware.com Acked-by: Guolin Yang gyang@vmware.com --- drivers/net/vmxnet3/vmxnet3_drv.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-)
diff --git a/drivers/net/vmxnet3/vmxnet3_drv.c b/drivers/net/vmxnet3/vmxnet3_drv.c index 682987040ea8..8f7ac7d85afc 100644 --- a/drivers/net/vmxnet3/vmxnet3_drv.c +++ b/drivers/net/vmxnet3/vmxnet3_drv.c @@ -1688,7 +1688,8 @@ vmxnet3_rq_rx_complete(struct vmxnet3_rx_queue *rq, if (unlikely(rcd->ts)) __vlan_hwaccel_put_tag(skb, htons(ETH_P_8021Q), rcd->tci);
- if (adapter->netdev->features & NETIF_F_LRO) + /* Use GRO callback if UPT is enabled */ + if ((adapter->netdev->features & NETIF_F_LRO) && !rq->shared->updateRxProd) netif_receive_skb(skb); else napi_gro_receive(&rq->napi, skb);
On 2023/3/9 6:25, Ronak Doshi wrote:
Currently, vmxnet3 uses GRO callback only if LRO is disabled. However, on smartNic based setups where UPT is supported, LRO can be enabled from guest VM but UPT devicve does not support LRO as of now. In such cases, there can be performance degradation as GRO is not being done.
This patch fixes this issue by calling GRO API when UPT is enabled. We use updateRxProd to determine if UPT mode is active or not.
Cc: stable@vger.kernel.org Fixes: 6f91f4ba046e ("vmxnet3: add support for capability registers") Signed-off-by: Ronak Doshi doshir@vmware.com Acked-by: Guolin Yang gyang@vmware.com
drivers/net/vmxnet3/vmxnet3_drv.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-)
diff --git a/drivers/net/vmxnet3/vmxnet3_drv.c b/drivers/net/vmxnet3/vmxnet3_drv.c index 682987040ea8..8f7ac7d85afc 100644 --- a/drivers/net/vmxnet3/vmxnet3_drv.c +++ b/drivers/net/vmxnet3/vmxnet3_drv.c @@ -1688,7 +1688,8 @@ vmxnet3_rq_rx_complete(struct vmxnet3_rx_queue *rq, if (unlikely(rcd->ts)) __vlan_hwaccel_put_tag(skb, htons(ETH_P_8021Q), rcd->tci);
if (adapter->netdev->features & NETIF_F_LRO)
/* Use GRO callback if UPT is enabled */
if ((adapter->netdev->features & NETIF_F_LRO) && !rq->shared->updateRxProd)
If UPT devicve does not support LRO, why not just clear the NETIF_F_LRO from adapter->netdev->features?
With above change, it seems that LRO is supported for user' POV, but the GRO is actually being done.
Also, if NETIF_F_LRO is set, do we need to clear the NETIF_F_GRO bit, so that there is no confusion for user?
netif_receive_skb(skb); else napi_gro_receive(&rq->napi, skb);
> > On 3/8/23, 4:34 PM, "Yunsheng Lin" <linyunsheng@huawei.com mailto:linyunsheng@huawei.com> wrote:
- if (adapter->netdev->features & NETIF_F_LRO)
- /* Use GRO callback if UPT is enabled */
- if ((adapter->netdev->features & NETIF_F_LRO) && !rq->shared->updateRxProd)
If UPT devicve does not support LRO, why not just clear the NETIF_F_LRO from adapter->netdev->features?
With above change, it seems that LRO is supported for user' POV, but the GRO is actually being done.
Also, if NETIF_F_LRO is set, do we need to clear the NETIF_F_GRO bit, so that there is no confusion for user?
We cannot clear LRO bit as the virtual nic can run in either emulation or UPT mode. When the vnic switches the mode between UPT and emulation, the guest vm is not notified. Hence, we use updateRxProd which is shared in datapath to check what mode is being run.
Also, we plan to add an event to notify the guest about this but that is for separate patch and may take some time.
Thanks, Ronak
On 2023/3/10 6:50, Ronak Doshi wrote:
> > On 3/8/23, 4:34 PM, "Yunsheng Lin" <linyunsheng@huawei.com mailto:linyunsheng@huawei.com> wrote:
- if (adapter->netdev->features & NETIF_F_LRO)
- /* Use GRO callback if UPT is enabled */
- if ((adapter->netdev->features & NETIF_F_LRO) && !rq->shared->updateRxProd)
If UPT devicve does not support LRO, why not just clear the NETIF_F_LRO from adapter->netdev->features?
With above change, it seems that LRO is supported for user' POV, but the GRO is actually being done.
Also, if NETIF_F_LRO is set, do we need to clear the NETIF_F_GRO bit, so that there is no confusion for user?
We cannot clear LRO bit as the virtual nic can run in either emulation or UPT mode. When the vnic switches the mode between UPT and emulation, the guest vm is not notified. Hence, we use updateRxProd which is shared in datapath to check what mode is being run.
So it is a run time thing? What happens if some LRO'ed packet is put in the rx queue, and the the vnic switches the mode to UPT, is it ok for those LRO'ed packets to go through the software GSO processing? If yes, why not just call napi_gro_receive() for LRO case too?
Looking closer, it seems vnic is implementing hw GRO from driver' view, as the driver is setting skb_shinfo(skb)->gso_* accordingly:
https://elixir.bootlin.com/linux/latest/source/drivers/net/vmxnet3/vmxnet3_d...
In that case, you may call napi_gro_receive() for those GRO'ed skb too, see:
https://lore.kernel.org/netdev/166479721495.20474.5436625882203781290.git-pa...
Also, we plan to add an event to notify the guest about this but that is for separate patch and may take some time.
Thanks, Ronak
> On 3/9/23, 5:02 PM, "Yunsheng Lin" <linyunsheng@huawei.com mailto:linyunsheng@huawei.com> wrote:
So it is a run time thing? What happens if some LRO'ed packet is put in the rx queue, and the the vnic switches the mode to UPT, is it ok for those LRO'ed packets to go through the software GSO processing?
Yes, it should be fine.
If yes, why not just call napi_gro_receive() for LRO case too?
We had done perf measurements in the past and it turned out this results in perf penalty. See https://patchwork.ozlabs.org/project/netdev/patch/1308947605-4300-1-git-send...
In fact, internally recently we did some perf measurements on RHEL 9.0, and it still showed some penalty.
Looking closer, it seems vnic is implementing hw GRO from driver' view, as the driver is setting skb_shinfo(skb)->gso_* accordingly:
https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Felixir.boo... https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Felixir.bootlin.com%2Flinux%2Flatest%2Fsource%2Fdrivers%2Fnet%2Fvmxnet3%2Fvmxnet3_drv.c%23L1665&data=05%7C01%7Cdoshir%40vmware.com%7C68e4b3dbd7d948887f0808db21031e2c%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C638140069565449054%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=LAw6oCG2MgYH4TPQAnWUy25E2u%2FDMSW2aSJ7OY2%2FOu8%3D&reserved=0
In that case, you may call napi_gro_receive() for those GRO'ed skb too, see:
I see. Seems this got added recently. This will need re-evaluation by the team based on ToT Linux. But this can be done in near future and as this might take time, for now this patch should be applied as UPT patches are already up-streamed.
Thanks, Ronak
On 2023/3/15 5:09, Ronak Doshi wrote:
> On 3/9/23, 5:02 PM, "Yunsheng Lin" <linyunsheng@huawei.com mailto:linyunsheng@huawei.com> wrote:
So it is a run time thing? What happens if some LRO'ed packet is put in the rx queue, and the the vnic switches the mode to UPT, is it ok for those LRO'ed packets to go through the software GSO processing?
Yes, it should be fine.
If yes, why not just call napi_gro_receive() for LRO case too?
We had done perf measurements in the past and it turned out this results in perf penalty. See https://patchwork.ozlabs.org/project/netdev/patch/1308947605-4300-1-git-send...
In fact, internally recently we did some perf measurements on RHEL 9.0, and it still showed some penalty.
Does clearing the NETIF_F_GRO for netdev->features bring back the performance? If no, maybe there is something need investigating.
Looking closer, it seems vnic is implementing hw GRO from driver' view, as the driver is setting skb_shinfo(skb)->gso_* accordingly:
https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Felixir.boo... https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Felixir.bootlin.com%2Flinux%2Flatest%2Fsource%2Fdrivers%2Fnet%2Fvmxnet3%2Fvmxnet3_drv.c%23L1665&data=05%7C01%7Cdoshir%40vmware.com%7C68e4b3dbd7d948887f0808db21031e2c%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C638140069565449054%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=LAw6oCG2MgYH4TPQAnWUy25E2u%2FDMSW2aSJ7OY2%2FOu8%3D&reserved=0
In that case, you may call napi_gro_receive() for those GRO'ed skb too, see:
I see. Seems this got added recently. This will need re-evaluation by the team based on ToT Linux. But this can be done in near future and as this might take time, for now this patch should be applied as UPT patches are already up-streamed.
Checking rq->shared->updateRxProd in the driver to decide if gro is allow does not seems right to me, as the netstack has used the NETIF_F_GRO checking in netif_elide_gro().
Does clearing NETIF_F_GRO for netdev->features during the driver init process works for your case?
As netdev->hw_features is for the driver to advertise the hw's capability, and the driver can enable/disable specific capability by setting netdev->features during the driver init process, and user can get to enable/disable specific capability using ethtool later if user need to.
Thanks, Ronak
On 3/14/23, 6:52 PM, "Yunsheng Lin" <linyunsheng@huawei.com mailto:linyunsheng@huawei.com> wrote:
Does clearing the NETIF_F_GRO for netdev->features bring back the performance? If no, maybe there is something need investigating.
Yes, it does. Simply using netif_receive_skb works fine.
Checking rq->shared->updateRxProd in the driver to decide if gro is allow does not seems right to me, as the netstack has used the NETIF_F_GRO checking in netif_elide_gro().
updateRxProd is NOT being used to determine if GRO is allowed. It is being used to indicate UPT is active, so the driver should just use GRO callback. This is as good as having only GRO callback for UPT driver which you were suggesting earlier.
Does clearing NETIF_F_GRO for netdev->features during the driver init process works for your case?
No this does not work as UPT mode can be enabled/disabled at runtime without guest being informed. This is para-virtualized driver and does not know if the guest is being run in emulation or UPT.
As netdev->hw_features is for the driver to advertise the hw's capability, and the driver can enable/disable specific capability by setting netdev->features during the driver init process, and user can get to enable/disable specific capability using ethtool later if user need to.
As I mentioned above, guest is not informed at runtime about UPT status. So, we need this mechanism to avoid performance penalty.
Thanks, Ronak
On 2023/3/15 10:27, Ronak Doshi wrote:
On 3/14/23, 6:52 PM, "Yunsheng Lin" <linyunsheng@huawei.com mailto:linyunsheng@huawei.com> wrote:
Does clearing the NETIF_F_GRO for netdev->features bring back the performance? If no, maybe there is something need investigating.
Yes, it does. Simply using netif_receive_skb works fine.
Checking rq->shared->updateRxProd in the driver to decide if gro is allow does not seems right to me, as the netstack has used the NETIF_F_GRO checking in netif_elide_gro().
updateRxProd is NOT being used to determine if GRO is allowed. It is being used to indicate UPT is active, so the driver should just use GRO callback. This is as good as having only GRO callback for UPT driver which you were suggesting earlier.
Does clearing NETIF_F_GRO for netdev->features during the driver init process works for your case?
No this does not work as UPT mode can be enabled/disabled at runtime without guest being informed. This is para-virtualized driver and does not know if the guest is being run in emulation or UPT.
I think checking updateRxProd in some way means the above para-virtualized driver need to know if the guest is being run in emulation or UPT.
I am not sure how we can handle the runtime hw capability changing thing yet, that is why I suggested setting the hw capability during the driver init process, then user can enable or disable GRO if need to.
Suppose user enable the software GRO using ethtool, disabling the GRO through some runtime checking seems against the will of the user.
Also, if you are able to "add an event to notify the guest about this", I suppose the para-virtualized driver will clear the specific bit in netdev->hw_features and netdev->features when handling the event? does user need to be notified about this, does user get confusion about this change without notification?
IMHO, being para-virtualized driver does not make any difference, the users do not care if they are configuring a netdev behind a para-virtualized driver or not.
As netdev->hw_features is for the driver to advertise the hw's capability, and the driver can enable/disable specific capability by setting netdev->features during the driver init process, and user can get to enable/disable specific capability using ethtool later if user need to.
As I mentioned above, guest is not informed at runtime about UPT status. So, we need this mechanism to avoid performance penalty.
Thanks, Ronak
On 3/14/23, 8:05 PM, "Yunsheng Lin" <linyunsheng@huawei.com mailto:linyunsheng@huawei.com> wrote:
I am not sure how we can handle the runtime hw capability changing thing yet, that is why I suggested setting the hw capability during the driver init process, then user can enable or disable GRO if need to.
It is not about enabling or disabling the LRO/GRO. It is about which callback to be used to deliver the packets to the stack.
During init, the vnic will always come up in emulation (non-UPT) mode and user can request whichever feature they want (lro or gro or both). If it is in UPT mode, as we know UPT device does not support LRO, we use gro API to deliver. If GRO is disabled by the user, then it can still take the normal path. If in emulation (non-UPT) mode, ESXi will perform LRO.
Suppose user enable the software GRO using ethtool, disabling the GRO through some runtime checking seems against the will of the user.
We are not disabling GRO here, it's either we perform LRO on ESXi or GRO in guest stack.
Also, if you are able to "add an event to notify the guest about this", I suppose the para-virtualized driver will clear the specific bit in netdev->hw_features and netdev->features when handling the event? does user need to be notified about this, does user get confusion about this change without notification?
We won’t be changing any feature bits. It is just to let know the driver that UPT is active and it should use GRO path instead of relying on ESXi LRO.
Thanks, Ronak
On 2023/3/16 7:44, Ronak Doshi wrote:
On 3/14/23, 8:05 PM, "Yunsheng Lin" <linyunsheng@huawei.com mailto:linyunsheng@huawei.com> wrote:
I am not sure how we can handle the runtime hw capability changing thing yet, that is why I suggested setting the hw capability during the driver init process, then user can enable or disable GRO if need to.
It is not about enabling or disabling the LRO/GRO. It is about which callback to be used to deliver the packets to the stack.
That's the piont I am trying to make. If I understand it correctly, you can not change callback from napi_gro_receive() to netif_receive_skb() when netdev->features has the NETIF_F_GRO bit set.
NETIF_F_GRO bit in netdev->features is to tell user that netstack will perform the software GRO processing if the packet can be GRO'ed.
Calling netif_receive_skb() with NETIF_F_GRO bit set in netdev->features will cause confusion for user, IMHO.
During init, the vnic will always come up in emulation (non-UPT) mode and user can request whichever feature they want (lro or gro or both). If it is in UPT mode, as we know UPT device does not support LRO, we use gro API to deliver. If GRO is disabled by the user, then it can still take the normal path. If in emulation (non-UPT) mode, ESXi will perform LRO.
Suppose user enable the software GRO using ethtool, disabling the GRO through some runtime checking seems against the will of the user.
We are not disabling GRO here, it's either we perform LRO on ESXi or GRO in guest stack.
I means software GRO performed by netstack. There are NETIF_F_GRO_HW and NETIF_F_LRO bit for GRO and LRO performed by hw. LRO on ESXi is like hw offload in the eye of the driver in the guest, even if it is processed by some software in the ESXi.
Also, if you are able to "add an event to notify the guest about this", I suppose the para-virtualized driver will clear the specific bit in netdev->hw_features and netdev->features when handling the event? does user need to be notified about this, does user get confusion about this change without notification?
We won’t be changing any feature bits. It is just to let know the driver that UPT is active and it should use GRO path instead of relying on ESXi LRO.
As above, there is different feature bit for that, NETIF_F_LRO, NETIF_F_GRO and NETIF_F_GRO_HW. IMHO, deciding which callback to be used depending on some driver configuation without corporation with the above feature bits does not seems right to me.
Thanks, Ronak
> On 3/15/23, 6:47 PM, "Yunsheng Lin" <linyunsheng@huawei.com mailto:linyunsheng@huawei.com> wrote:
That's the piont I am trying to make. If I understand it correctly, you can not change callback from napi_gro_receive() to netif_receive_skb() when netdev->features has the NETIF_F_GRO bit set.
Where are we doing this? Our preference is to use netif_receive_skb() only when LRO is enabled. If both LRO and GRO are enabled on the vnic, which API should be used?
NETIF_F_GRO bit in netdev->features is to tell user that netstack will perform the software GRO processing if the packet can be GRO'ed.
Even if the packet is already LRO'ed?
Calling netif_receive_skb() with NETIF_F_GRO bit set in netdev->features will cause confusion for user, IMHO.
As long as LRO is enabled and performed by ESXi (which it will do), I don’t think user cares for GRO. Even if we use napi_gro_receive() for such case, it degrades the performance as unnecessary cycles are spend on an already LRO'ed packet.
As above, there is different feature bit for that, NETIF_F_LRO, NETIF_F_GRO and NETIF_F_GRO_HW. IMHO, deciding which callback to be used depending on some driver configuation without corporation with the above feature bits does not seems right to me.
We are not neglecting feature bits. We just know that in UPT LRO won’t be done, so we by default use napi_gro_receive() callback.
Thanks, Ronak
On Thu, 16 Mar 2023 04:03:52 +0000 Ronak Doshi wrote:
Calling netif_receive_skb() with NETIF_F_GRO bit set in netdev->features will cause confusion for user, IMHO.
As long as LRO is enabled and performed by ESXi (which it will do), I don’t think user cares for GRO. Even if we use napi_gro_receive() for such case, it degrades the performance as unnecessary cycles are spend on an already LRO'ed packet.
Can you provide some numbers to illustrate what the slow down is?
On 3/15/23, 9:13 PM, "Jakub Kicinski" <kuba@kernel.org mailto:kuba@kernel.org> wrote:
Can you provide some numbers to illustrate what the slow down is?
Below are some sample test numbers collected by our perf team. Test socket & msg size base using only gro 1VM 14vcpu UDP stream receive 256K 256 bytes (packets/sec) 217.01 Kps 187.98 Kps -13.37% 16VM 2vcpu TCP stream send Thpt 8K 256 bytes (Gbps) 18.00 Gbps 17.02 Gbps -5.44% 1VM 14vcpu ResponseTimeMean Receive (in micro secs) 163 us 170 us -4.29%
In the past as well similar test was done. See https://patchwork.ozlabs.org/project/netdev/patch/1308947605-4300-1-git-send...
But, unfortunately there are no stats present in that discussion.
Thanks, Ronak
On Thu, 16 Mar 2023 05:21:42 +0000 Ronak Doshi wrote:
Below are some sample test numbers collected by our perf team. Test socket & msg size base using only gro 1VM 14vcpu UDP stream receive 256K 256 bytes (packets/sec) 217.01 Kps 187.98 Kps -13.37% 16VM 2vcpu TCP stream send Thpt 8K 256 bytes (Gbps) 18.00 Gbps 17.02 Gbps -5.44% 1VM 14vcpu ResponseTimeMean Receive (in micro secs) 163 us 170 us -4.29%
A bit more than I suspected, thanks for the data.
On 2023/3/17 4:34, Jakub Kicinski wrote:
On Thu, 16 Mar 2023 05:21:42 +0000 Ronak Doshi wrote:
Below are some sample test numbers collected by our perf team. Test socket & msg size base using only gro 1VM 14vcpu UDP stream receive 256K 256 bytes (packets/sec) 217.01 Kps 187.98 Kps -13.37% 16VM 2vcpu TCP stream send Thpt 8K 256 bytes (Gbps) 18.00 Gbps 17.02 Gbps -5.44% 1VM 14vcpu ResponseTimeMean Receive (in micro secs) 163 us 170 us -4.29%
A bit more than I suspected, thanks for the data.
Maybe we do some investigation to find out why the performace lost is more than suspected first.
For example if LRO'ed skb is added in gro_list->list, and then new LRO'ed skb from the same flow only go through the whole GSO processing only to find out we have to flush out the old LRO'ed in the gro_list->list, and add new LRO'ed skb in gro_list->list again?
.
> On 3/16/23, 7:37 PM, "Yunsheng Lin" <linyunsheng@huawei.com mailto:linyunsheng@huawei.com> wrote:
On 2023/3/17 4:34, Jakub Kicinski wrote: On Thu, 16 Mar 2023 05:21:42 +0000 Ronak Doshi wrote:
Below are some sample test numbers collected by our perf team. Test socket & msg size base using only gro 1VM 14vcpu UDP stream receive 256K 256 bytes (packets/sec) 217.01 Kps 187.98 Kps -13.37% 16VM 2vcpu TCP stream send Thpt 8K 256 bytes (Gbps) 18.00 Gbps 17.02 Gbps -5.44% 1VM 14vcpu ResponseTimeMean Receive (in micro secs) 163 us 170 us -4.29%
A bit more than I suspected, thanks for the data.
Maybe we do some investigation to find out why the performace lost is more than suspected first.
I don’t think holding this patch to investigate why it takes longer in GRO is worthwhile. That is a separate issue. UPT patches are already upstreamed to Linux and cross-ported to relevant distros for customers to use. We need to apply this patch to avoid the performance degradation in UPT mode as LRO is not available on UPT device.
I don’t see a functional issue with this patch. In UPT as LRO is not available, it needs to use GRO.
Thanks, Ronak
On Fri, 17 Mar 2023 20:27:50 +0000 Ronak Doshi wrote:
I don’t think holding this patch to investigate why it takes longer in GRO is worthwhile. That is a separate issue. UPT patches are already upstreamed to Linux and cross-ported to relevant distros for customers to use. We need to apply this patch to avoid the performance degradation in UPT mode as LRO is not available on UPT device.
I don’t see a functional issue with this patch. In UPT as LRO is not available, it needs to use GRO.
Fine by me, FWIW, but please respin the patch and feed some of the discussion into the commit message.
linux-stable-mirror@lists.linaro.org