DeepSpeed
DeepSpeed copied to clipboard
removed hardcoded "eth0" in OpenMPIRunner
When trying to launch a DeepSpeed training I encountered this error in which I couldn't override the "btl_tcp_if_include" flag with the --launcher_args and got stuck in an error because the interface of my node was different from eth0.
I propose to remove the hardcoded part and use a dynamic one that can interfere the correct interface, or somehow make it possible to override the "mpirun_cmd" with the arguments passed in --launcher_args as up to now this is not possible.
In example I tried to pass --launcher_args='--mca btl_tcp_if_include en0' and got an error because the argument was duplicated in the final command as it is just a sum of strings.