Resending as the first attempt is not showing up in the list archive.
This patch converts several network drivers to use smp_rmb rather than read_barrier_depends. The initial issue was discovered with ixgbe on a Power machine which resulted in skb list corruption due to fetching a stale skb pointer. More details can be found in the ixgbe patch description.
Brian King (7): ixgbe: Fix skb list corruption on Power systems i40e: Use smp_rmb rather than read_barrier_depends ixgbevf: Use smp_rmb rather than read_barrier_depends igbvf: Use smp_rmb rather than read_barrier_depends igb: Use smp_rmb rather than read_barrier_depends fm10k: Use smp_rmb rather than read_barrier_depends i40evf: Use smp_rmb rather than read_barrier_depends
drivers/net/ethernet/intel/fm10k/fm10k_main.c | 2 +- drivers/net/ethernet/intel/i40e/i40e_main.c | 2 +- drivers/net/ethernet/intel/i40e/i40e_txrx.c | 2 +- drivers/net/ethernet/intel/i40evf/i40e_txrx.c | 2 +- drivers/net/ethernet/intel/igb/igb_main.c | 2 +- drivers/net/ethernet/intel/igbvf/netdev.c | 2 +- drivers/net/ethernet/intel/ixgbe/ixgbe_main.c | 3 ++- drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c | 2 +- 8 files changed, 9 insertions(+), 8 deletions(-)
This patch fixes an issue seen on Power systems with ixgbe which results in skb list corruption and an eventual kernel oops. The following is what was observed:
CPU 1 CPU2 ============================ ============================ 1: ixgbe_xmit_frame_ring ixgbe_clean_tx_irq 2: first->skb = skb eop_desc = tx_buffer->next_to_watch 3: ixgbe_tx_map read_barrier_depends() 4: wmb check adapter written status bit 5: first->next_to_watch = tx_desc napi_consume_skb(tx_buffer->skb ..); 6: writel(i, tx_ring->tail);
The read_barrier_depends is insufficient to ensure that tx_buffer->skb does not get loaded prior to tx_buffer->next_to_watch, which then results in loading a stale skb pointer, since we aren't zeroing it, like is done in other similar code in other networking drivers. This patch addresses both of these issues, replacing the read_barrier_depends with an smp_rmb, which will ensure the load of tx_buffer->skb will not occur until after the load from tx_buffer->next_to_watch. Secondly, it zeroes tx_buffer->skb to make this consistent with other Intel ethernet drivers and elso ensure we don't leave a stale skb pointer sitting around.
Cc: stablestable@vger.kernel.org Signed-off-by: Brian King brking@linux.vnet.ibm.com --- drivers/net/ethernet/intel/ixgbe/ixgbe_main.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-)
diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c index 6d5f31e..4d8c7bb 100644 --- a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c +++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c @@ -1192,7 +1192,7 @@ static bool ixgbe_clean_tx_irq(struct ixgbe_q_vector *q_vector, break;
/* prevent any other reads prior to eop_desc */ - read_barrier_depends(); + smp_rmb();
/* if DD is not set pending work has not been completed */ if (!(eop_desc->wb.status & cpu_to_le32(IXGBE_TXD_STAT_DD))) @@ -1218,6 +1218,7 @@ static bool ixgbe_clean_tx_irq(struct ixgbe_q_vector *q_vector, DMA_TO_DEVICE);
/* clear tx_buffer data */ + tx_buffer->skb = NULL; dma_unmap_len_set(tx_buffer, len, 0);
/* unmap remaining buffers */
The original issue being fixed in this patch was seen with the ixgbe driver, but the same issue exists with i40e as well, as the code is very similar. read_barrier_depends is not sufficient to ensure loads following it are not speculatively loaded out of order by the CPU, which can result in stale data being loaded, causing potential system crashes.
Cc: stablestable@vger.kernel.org Signed-off-by: Brian King brking@linux.vnet.ibm.com --- drivers/net/ethernet/intel/i40e/i40e_main.c | 2 +- drivers/net/ethernet/intel/i40e/i40e_txrx.c | 2 +- 2 files changed, 2 insertions(+), 2 deletions(-)
diff --git a/drivers/net/ethernet/intel/i40e/i40e_main.c b/drivers/net/ethernet/intel/i40e/i40e_main.c index 6498da8..ea20aac 100644 --- a/drivers/net/ethernet/intel/i40e/i40e_main.c +++ b/drivers/net/ethernet/intel/i40e/i40e_main.c @@ -3760,7 +3760,7 @@ static bool i40e_clean_fdir_tx_irq(struct i40e_ring *tx_ring, int budget) break;
/* prevent any other reads prior to eop_desc */ - read_barrier_depends(); + smp_rmb();
/* if the descriptor isn't done, no work yet to do */ if (!(eop_desc->cmd_type_offset_bsz & diff --git a/drivers/net/ethernet/intel/i40e/i40e_txrx.c b/drivers/net/ethernet/intel/i40e/i40e_txrx.c index 120c68f..3c07ff1 100644 --- a/drivers/net/ethernet/intel/i40e/i40e_txrx.c +++ b/drivers/net/ethernet/intel/i40e/i40e_txrx.c @@ -759,7 +759,7 @@ static bool i40e_clean_tx_irq(struct i40e_vsi *vsi, break;
/* prevent any other reads prior to eop_desc */ - read_barrier_depends(); + smp_rmb();
i40e_trace(clean_tx_irq, tx_ring, tx_desc, tx_buf); /* we have caught up to head, no work left to do */
The original issue being fixed in this patch was seen with the ixgbe driver, but the same issue exists with ixgbevf as well, as the code is very similar. read_barrier_depends is not sufficient to ensure loads following it are not speculatively loaded out of order by the CPU, which can result in stale data being loaded, causing potential system crashes.
Cc: stablestable@vger.kernel.org Signed-off-by: Brian King brking@linux.vnet.ibm.com --- drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c b/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c index 032f8ac..90ecc4b 100644 --- a/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c +++ b/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c @@ -326,7 +326,7 @@ static bool ixgbevf_clean_tx_irq(struct ixgbevf_q_vector *q_vector, break;
/* prevent any other reads prior to eop_desc */ - read_barrier_depends(); + smp_rmb();
/* if DD is not set pending work has not been completed */ if (!(eop_desc->wb.status & cpu_to_le32(IXGBE_TXD_STAT_DD)))
The original issue being fixed in this patch was seen with the ixgbe driver, but the same issue exists with igbvf as well, as the code is very similar. read_barrier_depends is not sufficient to ensure loads following it are not speculatively loaded out of order by the CPU, which can result in stale data being loaded, causing potential system crashes.
Cc: stablestable@vger.kernel.org Signed-off-by: Brian King brking@linux.vnet.ibm.com --- drivers/net/ethernet/intel/igbvf/netdev.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/drivers/net/ethernet/intel/igbvf/netdev.c b/drivers/net/ethernet/intel/igbvf/netdev.c index 1ed5569..6f5888b 100644 --- a/drivers/net/ethernet/intel/igbvf/netdev.c +++ b/drivers/net/ethernet/intel/igbvf/netdev.c @@ -810,7 +810,7 @@ static bool igbvf_clean_tx_irq(struct igbvf_ring *tx_ring) break;
/* prevent any other reads prior to eop_desc */ - read_barrier_depends(); + smp_rmb();
/* if DD is not set pending work has not been completed */ if (!(eop_desc->wb.status & cpu_to_le32(E1000_TXD_STAT_DD)))
The original issue being fixed in this patch was seen with the ixgbe driver, but the same issue exists with igb as well, as the code is very similar. read_barrier_depends is not sufficient to ensure loads following it are not speculatively loaded out of order by the CPU, which can result in stale data being loaded, causing potential system crashes.
Cc: stablestable@vger.kernel.org Signed-off-by: Brian King brking@linux.vnet.ibm.com --- drivers/net/ethernet/intel/igb/igb_main.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/drivers/net/ethernet/intel/igb/igb_main.c b/drivers/net/ethernet/intel/igb/igb_main.c index ea69af2..b0031c5 100644 --- a/drivers/net/ethernet/intel/igb/igb_main.c +++ b/drivers/net/ethernet/intel/igb/igb_main.c @@ -6970,7 +6970,7 @@ static bool igb_clean_tx_irq(struct igb_q_vector *q_vector, int napi_budget) break;
/* prevent any other reads prior to eop_desc */ - read_barrier_depends(); + smp_rmb();
/* if DD is not set pending work has not been completed */ if (!(eop_desc->wb.status & cpu_to_le32(E1000_TXD_STAT_DD)))
The original issue being fixed in this patch was seen with the ixgbe driver, but the same issue exists with fm10k as well, as the code is very similar. read_barrier_depends is not sufficient to ensure loads following it are not speculatively loaded out of order by the CPU, which can result in stale data being loaded, causing potential system crashes.
Cc: stablestable@vger.kernel.org Signed-off-by: Brian King brking@linux.vnet.ibm.com --- drivers/net/ethernet/intel/fm10k/fm10k_main.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/drivers/net/ethernet/intel/fm10k/fm10k_main.c b/drivers/net/ethernet/intel/fm10k/fm10k_main.c index 9dffaba..103c0a7 100644 --- a/drivers/net/ethernet/intel/fm10k/fm10k_main.c +++ b/drivers/net/ethernet/intel/fm10k/fm10k_main.c @@ -1229,7 +1229,7 @@ static bool fm10k_clean_tx_irq(struct fm10k_q_vector *q_vector, break;
/* prevent any other reads prior to eop_desc */ - read_barrier_depends(); + smp_rmb();
/* if DD is not set pending work has not been completed */ if (!(eop_desc->flags & FM10K_TXD_FLAG_DONE))
The original issue being fixed in this patch was seen with the ixgbe driver, but the same issue exists with i40evf as well, as the code is very similar. read_barrier_depends is not sufficient to ensure loads following it are not speculatively loaded out of order by the CPU, which can result in stale data being loaded, causing potential system crashes.
Cc: stablestable@vger.kernel.org Signed-off-by: Brian King brking@linux.vnet.ibm.com --- drivers/net/ethernet/intel/i40evf/i40e_txrx.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/drivers/net/ethernet/intel/i40evf/i40e_txrx.c b/drivers/net/ethernet/intel/i40evf/i40e_txrx.c index c32c624..07a4e6e 100644 --- a/drivers/net/ethernet/intel/i40evf/i40e_txrx.c +++ b/drivers/net/ethernet/intel/i40evf/i40e_txrx.c @@ -179,7 +179,7 @@ static bool i40e_clean_tx_irq(struct i40e_vsi *vsi, break;
/* prevent any other reads prior to eop_desc */ - read_barrier_depends(); + smp_rmb();
i40e_trace(clean_tx_irq, tx_ring, tx_desc, tx_buf); /* if the descriptor isn't done, no work yet to do */
On Thu, 16 Nov 2017 09:37:48 -0600 Brian King brking@linux.vnet.ibm.com wrote:
Resending as the first attempt is not showing up in the list archive.
This patch converts several network drivers to use smp_rmb rather than read_barrier_depends. The initial issue was discovered with ixgbe on a Power machine which resulted in skb list corruption due to fetching a stale skb pointer. More details can be found in the ixgbe patch description.
Thanks for the fix Brian, I bet it was a tough debug.
The only users in the entire kernel of read_barrier_depends() (not smp_read_barrier_depends) are the Intel network drivers.
Wouldn't it be better for power to just fix read_barrier_depends to do the right thing on power? The question I'm not sure of the answer to is: Is it really the wrong barrier to be using or is the implementation in the kernel powerpc wrong?
So I think the right thing might actually to be to: Fix arch powerpc read_barrier_depends to not be a noop, as the semantics of the read_barrier_depends seems to be sufficient to solve this problem, but it seems not to work for powerpc?
Alex Duyck may have the most clarity on this problem, and possible solutions so I made sure to CC him.
On 11/16/2017 01:33 PM, Jesse Brandeburg wrote:
On Thu, 16 Nov 2017 09:37:48 -0600 Brian King brking@linux.vnet.ibm.com wrote:
Resending as the first attempt is not showing up in the list archive.
This patch converts several network drivers to use smp_rmb rather than read_barrier_depends. The initial issue was discovered with ixgbe on a Power machine which resulted in skb list corruption due to fetching a stale skb pointer. More details can be found in the ixgbe patch description.
Thanks for the fix Brian, I bet it was a tough debug.
The only users in the entire kernel of read_barrier_depends() (not smp_read_barrier_depends) are the Intel network drivers.
Wouldn't it be better for power to just fix read_barrier_depends to do the right thing on power? The question I'm not sure of the answer to is: Is it really the wrong barrier to be using or is the implementation in the kernel powerpc wrong?
So I think the right thing might actually to be to: Fix arch powerpc read_barrier_depends to not be a noop, as the semantics of the read_barrier_depends seems to be sufficient to solve this problem, but it seems not to work for powerpc?
Jesse,
Thanks for the quick response.
Cc'ing linuxppc-dev as well.
I did think about changing the powerpc definition of read_barrier_depends, but after reading up on that barrier, decided it was not the correct barrier to be used in this context. Here is some good historical background on read_barrier_depends that I found, along with an example.
https://lwn.net/Articles/5159/
Since there is no data-dependency in the code in question here, I think the smp_rmb is the proper barrier to use.
For background, the code in question looks like this:
CPU 1 CPU2 ============================ ============================ 1: ixgbe_xmit_frame_ring ixgbe_clean_tx_irq 2: first->skb = skb eop_desc = tx_buffer->next_to_watch if (!eop_desc) break; 3: ixgbe_tx_map read_barrier_depends() if (!(eop_desc->wb.status) ... ) break; 4: wmb 5: first->next_to_watch = tx_desc napi_consume_skb(tx_buffer->skb ..); 6: writel(i, tx_ring->tail);
What we see on powerpc is that tx_buffer->skb on CPU2 is getting loaded prior to tx_buffer->next_to_watch. Changing the read_barrier_depends to a smp_rmb solves this and prevents us from dereferencing old pointer.
-Brian
On Thu, 2017-11-16 at 14:03 -0600, Brian King wrote:
On 11/16/2017 01:33 PM, Jesse Brandeburg wrote:
On Thu, 16 Nov 2017 09:37:48 -0600 Brian King brking@linux.vnet.ibm.com wrote:
Resending as the first attempt is not showing up in the list archive.
This patch converts several network drivers to use smp_rmb rather than read_barrier_depends. The initial issue was discovered with ixgbe on a Power machine which resulted in skb list corruption due to fetching a stale skb pointer. More details can be found in the ixgbe patch description.
Thanks for the fix Brian, I bet it was a tough debug.
The only users in the entire kernel of read_barrier_depends() (not smp_read_barrier_depends) are the Intel network drivers.
Wouldn't it be better for power to just fix read_barrier_depends to do the right thing on power? The question I'm not sure of the answer to is: Is it really the wrong barrier to be using or is the implementation in the kernel powerpc wrong?
So I think the right thing might actually to be to: Fix arch powerpc read_barrier_depends to not be a noop, as the semantics of the read_barrier_depends seems to be sufficient to solve this problem, but it seems not to work for powerpc?
Jesse,
Thanks for the quick response.
Cc'ing linuxppc-dev as well.
I did think about changing the powerpc definition of read_barrier_depends, but after reading up on that barrier, decided it was not the correct barrier to be used in this context. Here is some good historical background on read_barrier_depends that I found, along with an example.
https://lwn.net/Articles/5159/
Since there is no data-dependency in the code in question here, I think the smp_rmb is the proper barrier to use.
For background, the code in question looks like this:
CPU 1 CPU2 ============================ ============================ 1: ixgbe_xmit_frame_ring ixgbe_clean_tx_irq 2: first->skb = skb eop_desc = tx_buffer->next_to_watch if (!eop_desc) break; 3: ixgbe_tx_map read_barrier_depends() if (!(eop_desc->wb.status) ... ) break; 4: wmb 5: first->next_to_watch = tx_desc napi_consume_skb(tx_buffer->skb ..); 6: writel(i, tx_ring->tail);
What we see on powerpc is that tx_buffer->skb on CPU2 is getting loaded prior to tx_buffer->next_to_watch. Changing the read_barrier_depends to a smp_rmb solves this and prevents us from dereferencing old pointer.
-Brian
So the barrier part I am okay with for all the drivers. I hadn't accounted for the skb being read before next_to_watch. I was more concerned about the descriptor ring versus buffer_info structure at the time I made use of that.
The updates to clear tx_buffer->skb in ixgbe I am not okay with. Basically the tell-tale sign for skb present is next_to_watch being non-null. The extra writes add overhead and I want to avoid that at all costs since I want to avoid as much bouncing between the xmit path and the Tx clean-up as possible.
- Alex
On Thu, 16 Nov 2017 14:03:02 -0600 Brian King brking@linux.vnet.ibm.com wrote:
I did think about changing the powerpc definition of read_barrier_depends, but after reading up on that barrier, decided it was not the correct barrier to be used in this context. Here is some good historical background on read_barrier_depends that I found, along with an example.
https://lwn.net/Articles/5159/
Since there is no data-dependency in the code in question here, I think the smp_rmb is the proper barrier to use.
Hey Brian, thanks for the explanation, I'll agree with you and Alex that the smb_rmb replacement is okay. Does your test still pass without the ->skb NULLs?
Brian King brking@linux.vnet.ibm.com writes:
On 11/16/2017 01:33 PM, Jesse Brandeburg wrote:
On Thu, 16 Nov 2017 09:37:48 -0600 Brian King brking@linux.vnet.ibm.com wrote:
Resending as the first attempt is not showing up in the list archive.
This patch converts several network drivers to use smp_rmb rather than read_barrier_depends. The initial issue was discovered with ixgbe on a Power machine which resulted in skb list corruption due to fetching a stale skb pointer. More details can be found in the ixgbe patch description.
Thanks for the fix Brian, I bet it was a tough debug.
The only users in the entire kernel of read_barrier_depends() (not smp_read_barrier_depends) are the Intel network drivers.
Wouldn't it be better for power to just fix read_barrier_depends to do the right thing on power? The question I'm not sure of the answer to is: Is it really the wrong barrier to be using or is the implementation in the kernel powerpc wrong?
So I think the right thing might actually to be to: Fix arch powerpc read_barrier_depends to not be a noop, as the semantics of the read_barrier_depends seems to be sufficient to solve this problem, but it seems not to work for powerpc?
Jesse,
Thanks for the quick response.
Cc'ing linuxppc-dev as well.
I did think about changing the powerpc definition of read_barrier_depends, but after reading up on that barrier, decided it was not the correct barrier to be used in this context. Here is some good historical background on read_barrier_depends that I found, along with an example.
https://lwn.net/Articles/5159/
Since there is no data-dependency in the code in question here, I think the smp_rmb is the proper barrier to use.
Yes I agree.
The read_barrier_depends() is correct to order the load of eop_desc and then the dependent load of eop_desc->wb.status, but it's only required or does anything on Alpha.
For background, the code in question looks like this:
CPU 1 CPU2 ============================ ============================ 1: ixgbe_xmit_frame_ring ixgbe_clean_tx_irq 2: first->skb = skb eop_desc = tx_buffer->next_to_watch if (!eop_desc) break; 3: ixgbe_tx_map read_barrier_depends() if (!(eop_desc->wb.status) ... ) break; 4: wmb 5: first->next_to_watch = tx_desc napi_consume_skb(tx_buffer->skb ..); 6: writel(i, tx_ring->tail);
What we see on powerpc is that tx_buffer->skb on CPU2 is getting loaded prior to tx_buffer->next_to_watch. Changing the read_barrier_depends to a smp_rmb solves this and prevents us from dereferencing old pointer.
Right. Given that read_barrier_depends() is a nop, there's nothing there to order the load of tx_buffer->skb vs anything else.
If it's actually the load of tx_buffer->skb that's the issue then the smp_rmb() should really be immediately prior to that, rather than where the read_barrier_depends() currently is.
cheers
On 11/16/2017 04:57 PM, Michael Ellerman wrote:
Brian King brking@linux.vnet.ibm.com writes:
On 11/16/2017 01:33 PM, Jesse Brandeburg wrote:
On Thu, 16 Nov 2017 09:37:48 -0600 Brian King brking@linux.vnet.ibm.com wrote:
Resending as the first attempt is not showing up in the list archive.
This patch converts several network drivers to use smp_rmb rather than read_barrier_depends. The initial issue was discovered with ixgbe on a Power machine which resulted in skb list corruption due to fetching a stale skb pointer. More details can be found in the ixgbe patch description.
Thanks for the fix Brian, I bet it was a tough debug.
The only users in the entire kernel of read_barrier_depends() (not smp_read_barrier_depends) are the Intel network drivers.
Wouldn't it be better for power to just fix read_barrier_depends to do the right thing on power? The question I'm not sure of the answer to is: Is it really the wrong barrier to be using or is the implementation in the kernel powerpc wrong?
So I think the right thing might actually to be to: Fix arch powerpc read_barrier_depends to not be a noop, as the semantics of the read_barrier_depends seems to be sufficient to solve this problem, but it seems not to work for powerpc?
Jesse,
Thanks for the quick response.
Cc'ing linuxppc-dev as well.
I did think about changing the powerpc definition of read_barrier_depends, but after reading up on that barrier, decided it was not the correct barrier to be used in this context. Here is some good historical background on read_barrier_depends that I found, along with an example.
https://lwn.net/Articles/5159/
Since there is no data-dependency in the code in question here, I think the smp_rmb is the proper barrier to use.
Yes I agree.
The read_barrier_depends() is correct to order the load of eop_desc and then the dependent load of eop_desc->wb.status, but it's only required or does anything on Alpha.
For background, the code in question looks like this:
CPU 1 CPU2 ============================ ============================ 1: ixgbe_xmit_frame_ring ixgbe_clean_tx_irq 2: first->skb = skb eop_desc = tx_buffer->next_to_watch if (!eop_desc) break; 3: ixgbe_tx_map read_barrier_depends() if (!(eop_desc->wb.status) ... ) break; 4: wmb 5: first->next_to_watch = tx_desc napi_consume_skb(tx_buffer->skb ..); 6: writel(i, tx_ring->tail);
What we see on powerpc is that tx_buffer->skb on CPU2 is getting loaded prior to tx_buffer->next_to_watch. Changing the read_barrier_depends to a smp_rmb solves this and prevents us from dereferencing old pointer.
Right. Given that read_barrier_depends() is a nop, there's nothing there to order the load of tx_buffer->skb vs anything else.
If it's actually the load of tx_buffer->skb that's the issue then the smp_rmb() should really be immediately prior to that, rather than where the read_barrier_depends() currently is.
Alex,
How would you like to proceed? read_barrier_depends is a noop on all archs except alpha and blackfin. On those two archs, read_barrier_depends and smp_rmb end up resulting in the same code. So, I can either:
1. Remove the setting of tx_buffer->skb to NULL to address your concern and proceed with the rest of the patch set unchanged.
2. Leave the read_barrier_depends, as it is the right barrier to order the load of eop_desc with respect to eop_desc->wb.status, and then *add* an smp_rmb in the same code path to address the speculative load of the skb that I was running into. This is arguably more pure from the perspective of the use of the different barriers, but has the downside of additional overhead on alpha and blackfin.
Do you have a preference?
Thanks,
Brian
On Fri, 2017-11-17 at 10:16 -0600, Brian King wrote:
On 11/16/2017 04:57 PM, Michael Ellerman wrote:
Brian King brking@linux.vnet.ibm.com writes:
On 11/16/2017 01:33 PM, Jesse Brandeburg wrote:
On Thu, 16 Nov 2017 09:37:48 -0600 Brian King brking@linux.vnet.ibm.com wrote:
Resending as the first attempt is not showing up in the list archive.
This patch converts several network drivers to use smp_rmb rather than read_barrier_depends. The initial issue was discovered with ixgbe on a Power machine which resulted in skb list corruption due to fetching a stale skb pointer. More details can be found in the ixgbe patch description.
Thanks for the fix Brian, I bet it was a tough debug.
The only users in the entire kernel of read_barrier_depends() (not smp_read_barrier_depends) are the Intel network drivers.
Wouldn't it be better for power to just fix read_barrier_depends to do the right thing on power? The question I'm not sure of the answer to is: Is it really the wrong barrier to be using or is the implementation in the kernel powerpc wrong?
So I think the right thing might actually to be to: Fix arch powerpc read_barrier_depends to not be a noop, as the semantics of the read_barrier_depends seems to be sufficient to solve this problem, but it seems not to work for powerpc?
Jesse,
Thanks for the quick response.
Cc'ing linuxppc-dev as well.
I did think about changing the powerpc definition of read_barrier_depends, but after reading up on that barrier, decided it was not the correct barrier to be used in this context. Here is some good historical background on read_barrier_depends that I found, along with an example.
https://lwn.net/Articles/5159/
Since there is no data-dependency in the code in question here, I think the smp_rmb is the proper barrier to use.
Yes I agree.
The read_barrier_depends() is correct to order the load of eop_desc and then the dependent load of eop_desc->wb.status, but it's only required or does anything on Alpha.
For background, the code in question looks like this:
CPU 1 CPU2 ============================ ============================ 1: ixgbe_xmit_frame_ring ixgbe_clean_tx_irq 2: first->skb = skb eop_desc = tx_buffer->next_to_watch if (!eop_desc) break; 3: ixgbe_tx_map read_barrier_depends() if (!(eop_desc->wb.status) ... ) break; 4: wmb 5: first->next_to_watch = tx_desc napi_consume_skb(tx_buffer->skb ..); 6: writel(i, tx_ring->tail);
What we see on powerpc is that tx_buffer->skb on CPU2 is getting loaded prior to tx_buffer->next_to_watch. Changing the read_barrier_depends to a smp_rmb solves this and prevents us from dereferencing old pointer.
Right. Given that read_barrier_depends() is a nop, there's nothing there to order the load of tx_buffer->skb vs anything else.
If it's actually the load of tx_buffer->skb that's the issue then the smp_rmb() should really be immediately prior to that, rather than where the read_barrier_depends() currently is.
Alex,
How would you like to proceed? read_barrier_depends is a noop on all archs except alpha and blackfin. On those two archs, read_barrier_depends and smp_rmb end up resulting in the same code. So, I can either:
- Remove the setting of tx_buffer->skb to NULL to address your concern and proceed
with the rest of the patch set unchanged.
I am good with this option. We just need to be certain that it solves the original issue you saw.
- Leave the read_barrier_depends, as it is the right barrier to order the load
of eop_desc with respect to eop_desc->wb.status, and then *add* an smp_rmb in the same code path to address the speculative load of the skb that I was running into. This is arguably more pure from the perspective of the use of the different barriers, but has the downside of additional overhead on alpha and blackfin.
Do you have a preference?
If you have the smp_rmb there is no need for the read_barrier_depends as having both barriers would be redundant anyway. It was there as more of a mental place holder than anything else since I suspect these drivers would never be run on an alpha architecture anyway.
Thanks,
Brian
Thanks for finding this issue and taking the time to resolve it.
- Alex
linux-stable-mirror@lists.linaro.org