opensm icon indicating copy to clipboard operation
opensm copied to clipboard

NDR fabric does not work properly with opensm

Open rixon opened this issue 1 year ago • 0 comments

On systems with NDR fabrics, the IB fabric does not work properly. IB traffic like MPI and IPoIB (with packets larger than ~400 bytes) will fail. When using opensmd from MLNX-OFED, or using inbox drivers and mlnxsm from mlnx_ib_mgmt, the NDR fabric is fully functional with both MPI and IPoIB.

opensm.log lists lots of errors. This is using opensm with a default config on RHEL 9.

To see if the configuration needs have changed, I attempted to use the opensm.conf from mlnxsm. Unfortunately, that also does not enable a proper working fabric. It seems like there is something missing in opensm from linux-rdma for use with Quantum2 switches.

Example basic test failing with IPoIB:

[root@test1 ~]# ping -s 500 -c 1 test2-ib
PING test2-ib (10.3.1.231) 500(528) bytes of data.
From test2-ib (10.3.1.254) icmp_seq=1 Destination Host Unreachable

--- test2-ib ping statistics ---
1 packets transmitted, 0 received, +1 errors, 100% packet loss, time 0ms

Example basic test failing with ibv_rc_pingpong:

[root@test3 ~]# ibv_rc_pingpong test1
  local address:  LID 0x0013, QPN 0x000063, PSN 0x566748, GID ::
  remote address: LID 0x0001, QPN 0x000048, PSN 0xdf4fa4, GID ::
Failed status transport retry counter exceeded (12) for wr_id 2
parse WC failed 1

Both of the above tests pass when using either of the commercial SMs.

rixon avatar Dec 27 '24 23:12 rixon