DeepSpeed icon indicating copy to clipboard operation
DeepSpeed copied to clipboard

removed hardcoded "eth0" in OpenMPIRunner

Open santurini opened this issue 2 years ago • 0 comments

When trying to launch a DeepSpeed training I encountered this error in which I couldn't override the "btl_tcp_if_include" flag with the --launcher_args and got stuck in an error because the interface of my node was different from eth0.

I propose to remove the hardcoded part and use a dynamic one that can interfere the correct interface, or somehow make it possible to override the "mpirun_cmd" with the arguments passed in --launcher_args as up to now this is not possible.

In example I tried to pass --launcher_args='--mca btl_tcp_if_include en0' and got an error because the argument was duplicated in the final command as it is just a sum of strings.

santurini avatar May 11 '23 11:05 santurini