[BUG]: Multiple node training error in VIT (2 nodes)
🐛 Describe the bug
Command: colossalai run --nproc_per_node 1 --host gpu21,gpu11 --master_addr gpu21 train.py --config ./configs/vit_mutinode.py --dummy_data
Failures: Traceback (most recent call last): File "/home/anaconda3/envs/colossalaipy39/lib/python3.9/site-packages/torch/distributed/run.py", line 762, in main run(args) File "/home/anaconda3/envs/colossalaipy39/lib/python3.9/site-packages/torch/distributed/run.py", line 753, in run elastic_launch( File "/home/anaconda3/envs/colossalaipy39/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 132, in call return launch_agent(self._config, self._entrypoint, list(args)) File "/home/anaconda3/envs/colossalaipy39/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 246, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
train.py FAILED
Root Cause (first observed failure): [0]: time : 2023-04-17_17:02:33 host : gpu21 rank : 1 (local_rank: 0) exitcode : 1 error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
Error: failed to run torchrun --nproc_per_node=1 --nnodes=2 --node_rank=0 --rdzv_backend=c10d --rdzv_endpoint=gpu11:29500 --rdzv_id=colossalai-default-job train.py --config ./configs/vit_mutinode.py --dummy_data on gpu21, is localhost: False, exception: Encountered a bad command exit code!
Results: ====== Training on All Nodes ===== gpu21: failure gpu11: success
I don't know why only one node can start normally when running with multiple GPU in multiple nodes. Does it require additional commands or modules?
Environment
No response
Similar unresolved issues: #2958.
Try torch.distributed.is_nccl_available()?
Yeah,The above command result is true,but the method in 2958 cannot solve it.However, using deepspeed and torch can train at multiple nodes.