From: Bailey Forrest bcf@google.com
The NIC requires each TSO segment to not span more than 10 descriptors. gve_can_send_tso() performs this check. However, the current code misses an edge case when a TSO skb has a large frag that needs to be split into multiple descriptors, causing the 10 descriptor limit per TSO-segment to be exceeded. This change fixes the edge case.
Fixes: a57e5de476be ("gve: DQO: Add TX path") Signed-off-by: Praveen Kaligineedi pkaligineedi@google.com Signed-off-by: Bailey Forrest bcf@google.com Reviewed-by: Jeroen de Borst jeroendb@google.com --- drivers/net/ethernet/google/gve/gve_tx_dqo.c | 22 +++++++++++++++++++- 1 file changed, 21 insertions(+), 1 deletion(-)
diff --git a/drivers/net/ethernet/google/gve/gve_tx_dqo.c b/drivers/net/ethernet/google/gve/gve_tx_dqo.c index 0b3cca3fc792..dc39dc481f21 100644 --- a/drivers/net/ethernet/google/gve/gve_tx_dqo.c +++ b/drivers/net/ethernet/google/gve/gve_tx_dqo.c @@ -866,22 +866,42 @@ static bool gve_can_send_tso(const struct sk_buff *skb) const int header_len = skb_tcp_all_headers(skb); const int gso_size = shinfo->gso_size; int cur_seg_num_bufs; + int last_frag_size; int cur_seg_size; int i;
cur_seg_size = skb_headlen(skb) - header_len; + last_frag_size = skb_headlen(skb); cur_seg_num_bufs = cur_seg_size > 0;
for (i = 0; i < shinfo->nr_frags; i++) { if (cur_seg_size >= gso_size) { cur_seg_size %= gso_size; cur_seg_num_bufs = cur_seg_size > 0; + + /* If the last buffer is split in the middle of a TSO + * segment, then it will count as two descriptors. + */ + if (last_frag_size > GVE_TX_MAX_BUF_SIZE_DQO) { + int last_frag_remain = last_frag_size % + GVE_TX_MAX_BUF_SIZE_DQO; + + /* If the last frag was evenly divisible by + * GVE_TX_MAX_BUF_SIZE_DQO, then it will not be + * split in the current segment. + */ + if (last_frag_remain && + cur_seg_size > last_frag_remain) { + cur_seg_num_bufs++; + } + } }
if (unlikely(++cur_seg_num_bufs > max_bufs_per_seg)) return false;
- cur_seg_size += skb_frag_size(&shinfo->frags[i]); + last_frag_size = skb_frag_size(&shinfo->frags[i]); + cur_seg_size += last_frag_size; }
return true;
Hi,
Thanks for your patch.
FYI: kernel test robot notices the stable kernel rule is not satisfied.
The check is based on https://www.kernel.org/doc/html/latest/process/stable-kernel-rules.html#opti...
Rule: add the tag "Cc: stable@vger.kernel.org" in the sign-off area to have the patch automatically included in the stable tree. Subject: [PATCH net] gve: Fix an edge case for TSO skb validity check Link: https://lore.kernel.org/stable/20240718190221.2219835-1-pkaligineedi%40googl...
Praveen Kaligineedi wrote:
From: Bailey Forrest bcf@google.com
The NIC requires each TSO segment to not span more than 10 descriptors. gve_can_send_tso() performs this check. However, the current code misses an edge case when a TSO skb has a large frag that needs to be split into multiple descriptors
because each descriptor may not exceed 16KB (GVE_TX_MAX_BUF_SIZE_DQO)
, causing the 10 descriptor limit per TSO-segment to be exceeded. This change fixes the edge case.
Fixes: a57e5de476be ("gve: DQO: Add TX path") Signed-off-by: Praveen Kaligineedi pkaligineedi@google.com Signed-off-by: Bailey Forrest bcf@google.com Reviewed-by: Jeroen de Borst jeroendb@google.com
drivers/net/ethernet/google/gve/gve_tx_dqo.c | 22 +++++++++++++++++++- 1 file changed, 21 insertions(+), 1 deletion(-)
diff --git a/drivers/net/ethernet/google/gve/gve_tx_dqo.c b/drivers/net/ethernet/google/gve/gve_tx_dqo.c index 0b3cca3fc792..dc39dc481f21 100644 --- a/drivers/net/ethernet/google/gve/gve_tx_dqo.c +++ b/drivers/net/ethernet/google/gve/gve_tx_dqo.c @@ -866,22 +866,42 @@ static bool gve_can_send_tso(const struct sk_buff *skb) const int header_len = skb_tcp_all_headers(skb); const int gso_size = shinfo->gso_size; int cur_seg_num_bufs;
- int last_frag_size;
nit: last_frag can be interpreted as frags[nr_frags - 1], perhaps prev_frag.
int cur_seg_size; int i; cur_seg_size = skb_headlen(skb) - header_len;
- last_frag_size = skb_headlen(skb); cur_seg_num_bufs = cur_seg_size > 0;
for (i = 0; i < shinfo->nr_frags; i++) { if (cur_seg_size >= gso_size) { cur_seg_size %= gso_size; cur_seg_num_bufs = cur_seg_size > 0;
/* If the last buffer is split in the middle of a TSO
s/buffer/frag?
* segment, then it will count as two descriptors.
*/
if (last_frag_size > GVE_TX_MAX_BUF_SIZE_DQO) {
int last_frag_remain = last_frag_size %
GVE_TX_MAX_BUF_SIZE_DQO;
/* If the last frag was evenly divisible by
* GVE_TX_MAX_BUF_SIZE_DQO, then it will not be
* split in the current segment.
Is this true even if the segment did not start at the start of the frag?
Overall, it's not trivial to follow. Probably because the goal is to count max descriptors per segment, but that is not what is being looped over.
Intuitive (perhaps buggy, a quick sketch), this is what is intended, right?
static bool gve_can_send_tso(const struct sk_buff *skb) { int frag_size = skb_headlen(skb) - header_len; int gso_size_left; int frag_idx = 0; int num_desc; int desc_len; int off = 0;
while (off < skb->len) { gso_size_left = shinfo->gso_size; num_desc = 0;
while (gso_size_left) { desc_len = min(gso_size_left, frag_size); gso_size_left -= desc_len; frag_size -= desc_len; num_desc++;
if (num_desc > max_descs_per_seg) return false;
if (!frag_size) frag_size = skb_frag_size(&shinfo->frags[frag_idx++]); } }
return true; }
This however loops skb->len / gso_size. While the above modulo operation skips many segments that span a frag. Not sure if the more intuitive approach could be as performant.
Else, I'll stare some more at the suggested patch to convince myself that it is correct and complete..
*/
*/
if (last_frag_remain &&
cur_seg_size > last_frag_remain) {
cur_seg_num_bufs++;
}
}}
if (unlikely(++cur_seg_num_bufs > max_bufs_per_seg)) return false;
cur_seg_size += skb_frag_size(&shinfo->frags[i]);
last_frag_size = skb_frag_size(&shinfo->frags[i]);
}cur_seg_size += last_frag_size;
return true; -- 2.45.2.993.g49e7a77208-goog
On Thu, Jul 18, 2024 at 4:07 PM Willem de Bruijn willemdebruijn.kernel@gmail.com wrote:
This however loops skb->len / gso_size. While the above modulo operation skips many segments that span a frag. Not sure if the more intuitive approach could be as performant.
Yes, the original intention of the code was to loop over nr_frags, instead of (skb->len / gso_size).
But perhaps that's premature optimization if it makes the code significantly harder to follow.
On Thu, Jul 18, 2024 at 8:28 PM Bailey Forrest bcf@google.com wrote:
On Thu, Jul 18, 2024 at 4:07 PM Willem de Bruijn willemdebruijn.kernel@gmail.com wrote:
This however loops skb->len / gso_size. While the above modulo operation skips many segments that span a frag. Not sure if the more intuitive approach could be as performant.
Yes, the original intention of the code was to loop over nr_frags, instead of (skb->len / gso_size).
But perhaps that's premature optimization if it makes the code significantly harder to follow.
Thanks. I don't mean to ask for a wholesale rewrite if not needed.
But perhaps the logic can be explained in the commit in a way that it is more immediately obvious.
Praveen suggested that. I'll respond to his reply in more detail.
On Thu, Jul 18, 2024 at 4:07 PM Willem de Bruijn willemdebruijn.kernel@gmail.com wrote:
* segment, then it will count as two descriptors.
*/
if (last_frag_size > GVE_TX_MAX_BUF_SIZE_DQO) {
int last_frag_remain = last_frag_size %
GVE_TX_MAX_BUF_SIZE_DQO;
/* If the last frag was evenly divisible by
* GVE_TX_MAX_BUF_SIZE_DQO, then it will not be
* split in the current segment.
Is this true even if the segment did not start at the start of the frag?
The comment probably is a bit confusing here. The current segment we are tracking could have a portion in the previous frag. The code assumed that the portion on the previous frag (if present) mapped to only one descriptor. However, that portion could have been split across two descriptors due to the restriction that each descriptor cannot exceed 16KB. That's the case this fix is trying to address. I will work on simplifying the logic based on your suggestion below so that the fix is easier to follow
Overall, it's not trivial to follow. Probably because the goal is to count max descriptors per segment, but that is not what is being looped over.
The comment
Intuitive (perhaps buggy, a quick sketch), this is what is intended, right?
static bool gve_can_send_tso(const struct sk_buff *skb) { int frag_size = skb_headlen(skb) - header_len; int gso_size_left; int frag_idx = 0; int num_desc; int desc_len; int off = 0;
while (off < skb->len) { gso_size_left = shinfo->gso_size; num_desc = 0; while (gso_size_left) { desc_len = min(gso_size_left, frag_size); gso_size_left -= desc_len; frag_size -= desc_len; num_desc++; if (num_desc > max_descs_per_seg) return false; if (!frag_size) frag_size = skb_frag_size(&shinfo->frags[frag_idx++]); } } return true;
}
This however loops skb->len / gso_size. While the above modulo operation skips many segments that span a frag. Not sure if the more intuitive approach could be as performant.
Else, I'll stare some more at the suggested patch to convince myself that it is correct and complete..
On Thu, Jul 18, 2024 at 9:52 PM Praveen Kaligineedi pkaligineedi@google.com wrote:
On Thu, Jul 18, 2024 at 4:07 PM Willem de Bruijn willemdebruijn.kernel@gmail.com wrote:
* segment, then it will count as two descriptors.
*/
if (last_frag_size > GVE_TX_MAX_BUF_SIZE_DQO) {
int last_frag_remain = last_frag_size %
GVE_TX_MAX_BUF_SIZE_DQO;
/* If the last frag was evenly divisible by
* GVE_TX_MAX_BUF_SIZE_DQO, then it will not be
* split in the current segment.
Is this true even if the segment did not start at the start of the frag?
The comment probably is a bit confusing here. The current segment we are tracking could have a portion in the previous frag. The code assumed that the portion on the previous frag (if present) mapped to only one descriptor. However, that portion could have been split across two descriptors due to the restriction that each descriptor cannot exceed 16KB.
/* If the last frag was evenly divisible by
* GVE_TX_MAX_BUF_SIZE_DQO, then it will not be
* split in the current segment.
This is true because the smallest multiple of 16KB is 32KB, and the largest gso_size at least for Ethernet will be 9K. But I don't think that that is what is used here as the basis for this statement?
That's the case this fix is trying to address. I will work on simplifying the logic based on your suggestion below so that the fix is easier to follow
On Thu, Jul 18, 2024 at 8:47 PM Willem de Bruijn willemdebruijn.kernel@gmail.com wrote:
On Thu, Jul 18, 2024 at 9:52 PM Praveen Kaligineedi pkaligineedi@google.com wrote:
On Thu, Jul 18, 2024 at 4:07 PM Willem de Bruijn willemdebruijn.kernel@gmail.com wrote:
* segment, then it will count as two descriptors.
*/
if (last_frag_size > GVE_TX_MAX_BUF_SIZE_DQO) {
int last_frag_remain = last_frag_size %
GVE_TX_MAX_BUF_SIZE_DQO;
/* If the last frag was evenly divisible by
* GVE_TX_MAX_BUF_SIZE_DQO, then it will not be
* split in the current segment.
Is this true even if the segment did not start at the start of the frag?
The comment probably is a bit confusing here. The current segment we are tracking could have a portion in the previous frag. The code assumed that the portion on the previous frag (if present) mapped to only one descriptor. However, that portion could have been split across two descriptors due to the restriction that each descriptor cannot exceed 16KB.
/* If the last frag was evenly divisible by
* GVE_TX_MAX_BUF_SIZE_DQO, then it will not be
* split in the current segment.
This is true because the smallest multiple of 16KB is 32KB, and the largest gso_size at least for Ethernet will be 9K. But I don't think that that is what is used here as the basis for this statement?
The largest Ethernet gso_size (9K) is less than GVE_TX_MAX_BUF_SIZE_DQO is an implicit assumption made in this patch and in that comment. Bailey, please correct me if I am wrong..
That's the case this fix is trying to address. I will work on simplifying the logic based on your suggestion below so that the fix is easier to follow
On Fri, Jul 19, 2024 at 7:31 AM Praveen Kaligineedi pkaligineedi@google.com wrote:
On Thu, Jul 18, 2024 at 8:47 PM Willem de Bruijn willemdebruijn.kernel@gmail.com wrote:
On Thu, Jul 18, 2024 at 9:52 PM Praveen Kaligineedi pkaligineedi@google.com wrote:
On Thu, Jul 18, 2024 at 4:07 PM Willem de Bruijn willemdebruijn.kernel@gmail.com wrote:
* segment, then it will count as two descriptors.
*/
if (last_frag_size > GVE_TX_MAX_BUF_SIZE_DQO) {
int last_frag_remain = last_frag_size %
GVE_TX_MAX_BUF_SIZE_DQO;
/* If the last frag was evenly divisible by
* GVE_TX_MAX_BUF_SIZE_DQO, then it will not be
* split in the current segment.
Is this true even if the segment did not start at the start of the frag?
The comment probably is a bit confusing here. The current segment we are tracking could have a portion in the previous frag. The code assumed that the portion on the previous frag (if present) mapped to only one descriptor. However, that portion could have been split across two descriptors due to the restriction that each descriptor cannot exceed 16KB.
/* If the last frag was evenly divisible by
* GVE_TX_MAX_BUF_SIZE_DQO, then it will not be
* split in the current segment.
This is true because the smallest multiple of 16KB is 32KB, and the largest gso_size at least for Ethernet will be 9K. But I don't think that that is what is used here as the basis for this statement?
The largest Ethernet gso_size (9K) is less than GVE_TX_MAX_BUF_SIZE_DQO is an implicit assumption made in this patch and in that comment. Bailey, please correct me if I am wrong..
If last_frag_size is evenly divisible by GVE_TX_MAX_BUF_SIZE_DQO, it doesn't hit the edge case we're looking for.
- If it's evenly divisible, then we know it will use exactly (last_frag_size / GVE_TX_MAX_BUF_SIZE_DQO) descriptors - GVE_TX_MAX_BUF_SIZE_DQO > 9k, so we know each descriptor won't create a segment which exceeds the limit
On Fri, Jul 19, 2024 at 9:56 AM Bailey Forrest bcf@google.com wrote:
On Fri, Jul 19, 2024 at 7:31 AM Praveen Kaligineedi pkaligineedi@google.com wrote:
On Thu, Jul 18, 2024 at 8:47 PM Willem de Bruijn willemdebruijn.kernel@gmail.com wrote:
On Thu, Jul 18, 2024 at 9:52 PM Praveen Kaligineedi pkaligineedi@google.com wrote:
On Thu, Jul 18, 2024 at 4:07 PM Willem de Bruijn willemdebruijn.kernel@gmail.com wrote:
* segment, then it will count as two descriptors.
*/
if (last_frag_size > GVE_TX_MAX_BUF_SIZE_DQO) {
int last_frag_remain = last_frag_size %
GVE_TX_MAX_BUF_SIZE_DQO;
/* If the last frag was evenly divisible by
* GVE_TX_MAX_BUF_SIZE_DQO, then it will not be
* split in the current segment.
Is this true even if the segment did not start at the start of the frag?
The comment probably is a bit confusing here. The current segment we are tracking could have a portion in the previous frag. The code assumed that the portion on the previous frag (if present) mapped to only one descriptor. However, that portion could have been split across two descriptors due to the restriction that each descriptor cannot exceed 16KB.
/* If the last frag was evenly divisible by
* GVE_TX_MAX_BUF_SIZE_DQO, then it will not be
* split in the current segment.
This is true because the smallest multiple of 16KB is 32KB, and the largest gso_size at least for Ethernet will be 9K. But I don't think that that is what is used here as the basis for this statement?
The largest Ethernet gso_size (9K) is less than GVE_TX_MAX_BUF_SIZE_DQO is an implicit assumption made in this patch and in that comment. Bailey, please correct me if I am wrong..
If last_frag_size is evenly divisible by GVE_TX_MAX_BUF_SIZE_DQO, it doesn't hit the edge case we're looking for.
- If it's evenly divisible, then we know it will use exactly
(last_frag_size / GVE_TX_MAX_BUF_SIZE_DQO) descriptors
This assumes that gso_segment start is aligned with skb_frag start. That is not necessarily true, right?
If headlen minus protocol headers is 1B, then the first segment will have two descriptors { 1B, 9KB - 1 }. And the next segment can have skb_frag_size - ( 9KB - 1).
I think the statement is correct, but because every multiple of 16KB is so much larger than the max gso_size of ~9KB, that a single segment will never include more than two skb_frags.
Quite possibly the code overestimates the number of descriptors per segment now, but that is safe and only a performance regression.
- GVE_TX_MAX_BUF_SIZE_DQO > 9k, so we know each descriptor won't
create a segment which exceeds the limit
For a net patch, it is generally better to make a small fix rather than rewrite.
That said, my sketch without looping over every segment:
while (off < skb->len) { gso_size_left = shinfo->gso_size; num_desc = 0;
while (gso_size_left) { desc_len = min(gso_size_left, frag_size_left); gso_size_left -= desc_len; frag_size_left -= desc_len; num_desc++;
if (num_desc > max_descs_per_seg) return false;
if (!frag_size_left) frag_size_left = skb_frag_size(&shinfo->frags[frag_idx++]); + else + frag_size_left %= gso_size; /* skip segments that fit in one desc */ } }
On 7/21/24 21:10, Willem de Bruijn wrote:
On Fri, Jul 19, 2024 at 9:56 AM Bailey Forrest bcf@google.com wrote:
If last_frag_size is evenly divisible by GVE_TX_MAX_BUF_SIZE_DQO, it doesn't hit the edge case we're looking for.
- If it's evenly divisible, then we know it will use exactly
(last_frag_size / GVE_TX_MAX_BUF_SIZE_DQO) descriptors
This assumes that gso_segment start is aligned with skb_frag start. That is not necessarily true, right?
If headlen minus protocol headers is 1B, then the first segment will have two descriptors { 1B, 9KB - 1 }. And the next segment can have skb_frag_size - ( 9KB - 1).
I think the statement is correct, but because every multiple of 16KB is so much larger than the max gso_size of ~9KB, that a single segment will never include more than two skb_frags.
I'm a bit lost, but what abut big TCP? that should allow (considerably) more than 2 frags 16K each per segment.
In any case it looks like this needs at least some comments clarification.
Thanks,
Paolo
linux-stable-mirror@lists.linaro.org