a naïve questions on using byteps for distributed training

Open chenxie95 opened this issue 5 years ago • 1 comments

Many thanks for sharing the implementation. It is really interesting and promising. I want to have a try in my training as well. I have got some naïve questions on this. Now I could build the docker image and repeat the experiment with 1 machine (with 4 or 8 GPUs in the machine) easily. However, I encountered some issue for distributed training. The way I used for distributed training before is with MPI interface, which only need to run one command for training with Pytorch datadistributedparallel. However, now it seems that I need to run at least four different command according to the tutorial in https://github.com/bytedance/byteps/blob/master/docs/step-by-step-tutorial.md , which are for server, scheduler, workers (at least 2 for distributed training) respectively.
For example, if I am using 16 GPUs, what I tried for distributed training is: for rank 0, run worker-1 for rank 8, run worker-2 for rank 1, run server for rank 2, run scheduler for other ranks, do nothing But it will occur some strange error and exit. May I ask for configuration like mine, what is the best way to launch the training with byteps? Thanks again

Dec 06 '20 22:12 chenxie95

There is a launcher you can try, see the README in the folder https://github.com/bytedance/byteps/tree/master/launcher

Dec 07 '20 04:12 bobzhuyb