It seems to fix the issue based on tests with my synthetic reproducer which consists of two carefully placed sleeps-- one in start_xmit and another before cm_rep_handler to force open the race window. In order to test it organically I need to commandeer many hundreds of nodes in our cluster which can be fairly disruptive to our user community. I'll send over v3 as an RFC and it would be great to get feedback about whether or not the patch is acceptable. If it is, I'll work on scheduling an at-scale test after which I'll submit v3. I think the odds are very low the synthetic reproducer would indicate this problem as fixed but the real world test would still experience the problem. I try to be thorough, though.
-Aaron
On 8/20/18 1:40 PM, Aaron Knister wrote:
On 8/20/18 12:28 PM, Jason Gunthorpe wrote:
On Mon, Aug 20, 2018 at 09:36:53AM +0300, Erez Shitrit wrote:
Hi,
Did you check the option to hold the netif_tx_lock_xxx() in ipoib_cm_rep_handler function (over the line set_bit(IPOIB_FLAG_OPER_UP)) instead of in the send data path flow?
That does seem better, then the test_bit in the datapath could become non-atomic too :)
Jason
Thanks for the feedback! I've not tried that but I certainly can.
-Aaron