[REQUEST] Can the DeepSpeed support automatic selection of different types of network cards, such as Ethernet and high-speed IB network cards?
Our scenario involves two heterogeneous GPU clusters, Cluster A and Cluster B, each consisting of 20 GPU machines (A100-80G). Cluster A is internally equipped with both high-speed IB cards and regular Ethernet cards, while Cluster B is internally equipped with high-speed RoCE cards and regular Ethernet cards. Due to the inability to establish a high-speed IB network between Cluster A and Cluster B, communication between them can only be done via Ethernet using TCP/IP (Socket).
Our objective is to have machines within Cluster A and Cluster B connected through high-speed cards, while machines between Cluster A and Cluster B are connected through regular Ethernet networking. I would like to modify DeepSpeed to support automatic card detection and configuration to enable distributed training between the heterogeneous GPU clusters.
I am unsure of the feasibility and would greatly appreciate your response.
Thank you very much!