DeepSpeed icon indicating copy to clipboard operation
DeepSpeed copied to clipboard

[REQUEST] Can the DeepSpeed support automatic selection of different types of network cards, such as Ethernet and high-speed IB network cards?

Open pengshuang opened this issue 2 years ago • 0 comments

Our scenario involves two heterogeneous GPU clusters, Cluster A and Cluster B, each consisting of 20 GPU machines (A100-80G). Cluster A is internally equipped with both high-speed IB cards and regular Ethernet cards, while Cluster B is internally equipped with high-speed RoCE cards and regular Ethernet cards. Due to the inability to establish a high-speed IB network between Cluster A and Cluster B, communication between them can only be done via Ethernet using TCP/IP (Socket).

Our objective is to have machines within Cluster A and Cluster B connected through high-speed cards, while machines between Cluster A and Cluster B are connected through regular Ethernet networking. I would like to modify DeepSpeed to support automatic card detection and configuration to enable distributed training between the heterogeneous GPU clusters.

I am unsure of the feasibility and would greatly appreciate your response.

Thank you very much!

pengshuang avatar Jun 09 '23 14:06 pengshuang