ColossalAI icon indicating copy to clipboard operation
ColossalAI copied to clipboard

[BUG]: Multiple node training error in VIT (2 nodes)

Open fearless1007 opened this issue 2 years ago • 1 comments

🐛 Describe the bug

Command: colossalai run --nproc_per_node 1 --host gpu21,gpu11 --master_addr gpu21 train.py --config ./configs/vit_mutinode.py --dummy_data

Failures: Traceback (most recent call last): File "/home/anaconda3/envs/colossalaipy39/lib/python3.9/site-packages/torch/distributed/run.py", line 762, in main run(args) File "/home/anaconda3/envs/colossalaipy39/lib/python3.9/site-packages/torch/distributed/run.py", line 753, in run elastic_launch( File "/home/anaconda3/envs/colossalaipy39/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 132, in call return launch_agent(self._config, self._entrypoint, list(args)) File "/home/anaconda3/envs/colossalaipy39/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 246, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

train.py FAILED

Root Cause (first observed failure): [0]: time : 2023-04-17_17:02:33 host : gpu21 rank : 1 (local_rank: 0) exitcode : 1 error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

Error: failed to run torchrun --nproc_per_node=1 --nnodes=2 --node_rank=0 --rdzv_backend=c10d --rdzv_endpoint=gpu11:29500 --rdzv_id=colossalai-default-job train.py --config ./configs/vit_mutinode.py --dummy_data on gpu21, is localhost: False, exception: Encountered a bad command exit code!

Results: ====== Training on All Nodes ===== gpu21: failure gpu11: success

I don't know why only one node can start normally when running with multiple GPU in multiple nodes. Does it require additional commands or modules?

Environment

No response

fearless1007 avatar Apr 17 '23 09:04 fearless1007

Similar unresolved issues: #2958.

Try torch.distributed.is_nccl_available()?

JThh avatar Apr 18 '23 05:04 JThh

Yeah,The above command result is true,but the method in 2958 cannot solve it.However, using deepspeed and torch can train at multiple nodes.

fearless1007 avatar Apr 27 '23 07:04 fearless1007