a naïve questions on using byteps for distributed training
Many thanks for sharing the implementation. It is really interesting and promising. I want to have a try in my training as well. I have got some naïve questions on this.
Now I could build the docker image and repeat the experiment with 1 machine (with 4 or 8 GPUs in the machine) easily.
However, I encountered some issue for distributed training.
The way I used for distributed training before is with MPI interface, which only need to run one command for training with Pytorch datadistributedparallel.
However, now it seems that I need to run at least four different command according to the tutorial in https://github.com/bytedance/byteps/blob/master/docs/step-by-step-tutorial.md , which are for server, scheduler, workers (at least 2 for distributed training) respectively.
For example, if I am using 16 GPUs, what I tried for distributed training is:
for rank 0, run worker-1
for rank 8, run worker-2
for rank 1, run server
for rank 2, run scheduler
for other ranks, do nothing
But it will occur some strange error and exit.
May I ask for configuration like mine, what is the best way to launch the training with byteps?
Thanks again
There is a launcher you can try, see the README in the folder https://github.com/bytedance/byteps/tree/master/launcher