On Wed, Mar 8, 2023 at 4:42 AM Xiubo Li xiubli@redhat.com wrote:
How could this happen ?
Since the req hasn't been submitted yet, how could it receive a reply normally ?
I have no idea. We have frequent problems with MDS closing the connection (once or twice a week), and sometimes, this leads to the WARNING problem which leaves the server hanging. This seems to be some timing problem, but that MDS connection problem is a different problem. My patch just attempts to address the WARNING; not knowing much about Ceph internals, my idea was that even if the server sends bad reply packets, the client shouldn't panic.
It should be a corrupted reply and it lead us to get a incorrect req, which hasn't been submitted yet.
BTW, do you have the dump of the corrupted msg by 'ceph_msg_dump(msg)' ?
Unfortunately not - we have already scrubbed the server that had this problem and rebooted it with a fresh image including my patch. It seems I don't have a full copy of the kernel log anymore.
Coincidentally, the patch has prevented another kernel hang just a few minutes ago:
Mar 08 15:48:53 sweb1 kernel: ceph: mds0 caps stale Mar 08 15:49:13 sweb1 kernel: ceph: mds0 caps stale Mar 08 15:49:35 sweb1 kernel: ceph: mds0 caps went stale, renewing Mar 08 15:49:35 sweb1 kernel: ceph: mds0 caps stale Mar 08 15:49:35 sweb1 kernel: libceph: mds0 (1)10.41.2.11:6801 socket error on write Mar 08 15:49:35 sweb1 kernel: libceph: mds0 (1)10.41.2.11:6801 session reset Mar 08 15:49:35 sweb1 kernel: ceph: mds0 closed our session Mar 08 15:49:35 sweb1 kernel: ceph: mds0 reconnect start Mar 08 15:49:36 sweb1 kernel: ceph: mds0 reconnect success Mar 08 15:49:36 sweb1 kernel: ceph: dropping dirty+flushing Fx state for 0000000064778286 2199046848012 Mar 08 15:49:40 sweb1 kernel: ceph: mdsc_handle_reply on waiting request tid 1106187 Mar 08 15:49:53 sweb1 kernel: ceph: mds0 caps renewed
Since my patch is already in place, the kernel hasn't checked the unexpected packet and thus hasn't dumped it....
If you need more information and have a patch with more logging, I could easily boot those servers with your patch and post that data next time it happens.
Max