torch.distributed.DistBackendError: NCCL error in: ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1333, remote process exited or there was a network error, NCCL version 2.18.6
Hi, I successfully ran the 'cifar10_deepspeed.py' example on a single node (2xNVIDIA 3090). Now I want to run the same program on multi-nodes (2 nodes each have 2 3090s.). I refer to the example here to run my program.
my hostfile:
192.168.3.100 slots=2
192.168.3.101 slots=2
then I ran the command:
deepspeed --num_gpus 2 --num_nodes 2 --master_addr 192.168.3.100 --hostfile hostfile cifar10_deepspeed.py --deepspeed_config ds_config.json
I also ran these commands on both 192.168.3.100 and 192.168.3.101:
export NCCL_IB_DISABLE=1
export NCCL_SOCKET_IFNAME=eno1
export NCCL_P2P_DISABLE=1
export NCCL_DEBUG=INFO
And I met the error:
[2024-01-08 16:13:55,768] [INFO] [real_accelerator.py:161:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-01-08 16:13:56,462] [INFO] [multinode_runner.py:80:get_cmd] Running on the following workers: 192.168.3.100,192.168.3.101
[2024-01-08 16:13:56,463] [INFO] [runner.py:571:main] cmd = pdsh -S -f 1024 -w 192.168.3.100,192.168.3.101 source /root/SYH/GithubCode/DeepSpeedExamples/training/cifar/setup_env1.sh export NCCL_SOCKET_IFNAME=eno1; export PYTHONPATH=/root/SYH/GithubCode/DeepSpeedExamples/training/cifar; cd /root/SYH/GithubCode/DeepSpeedExamples/training/cifar; /root/anaconda3/envs/DS/bin/python -u -m deepspeed.launcher.launch --world_info=eyIxOTIuMTY4LjMuMTAwIjogWzAsIDFdLCAiMTkyLjE2OC4zLjEwMSI6IFswLCAxXX0= --node_rank=%n --master_addr=192.168.3.100 --master_port=29500 cifar10_deepspeed.py --deepspeed_config ds_config.json
192.168.3.100: [2024-01-08 16:13:59,770] [INFO] [real_accelerator.py:161:get_accelerator] Setting ds_accelerator to cuda (auto detect)
192.168.3.101: [2024-01-08 16:13:59,842] [INFO] [real_accelerator.py:161:get_accelerator] Setting ds_accelerator to cuda (auto detect)
192.168.3.100: [2024-01-08 16:14:00,061] [INFO] [launch.py:145:main] WORLD INFO DICT: {'192.168.3.100': [0, 1], '192.168.3.101': [0, 1]}
192.168.3.100: [2024-01-08 16:14:00,061] [INFO] [launch.py:151:main] nnodes=2, num_local_procs=2, node_rank=0
192.168.3.100: [2024-01-08 16:14:00,061] [INFO] [launch.py:162:main] global_rank_mapping=defaultdict(<class 'list'>, {'192.168.3.100': [0, 1], '192.168.3.101': [2, 3]})
192.168.3.100: [2024-01-08 16:14:00,061] [INFO] [launch.py:163:main] dist_world_size=4
192.168.3.100: [2024-01-08 16:14:00,061] [INFO] [launch.py:165:main] Setting CUDA_VISIBLE_DEVICES=0,1
192.168.3.101: [2024-01-08 16:14:00,102] [INFO] [launch.py:145:main] WORLD INFO DICT: {'192.168.3.100': [0, 1], '192.168.3.101': [0, 1]}
192.168.3.101: [2024-01-08 16:14:00,102] [INFO] [launch.py:151:main] nnodes=2, num_local_procs=2, node_rank=1
192.168.3.101: [2024-01-08 16:14:00,103] [INFO] [launch.py:162:main] global_rank_mapping=defaultdict(<class 'list'>, {'192.168.3.100': [0, 1], '192.168.3.101': [2, 3]})
192.168.3.101: [2024-01-08 16:14:00,103] [INFO] [launch.py:163:main] dist_world_size=4
192.168.3.101: [2024-01-08 16:14:00,103] [INFO] [launch.py:165:main] Setting CUDA_VISIBLE_DEVICES=0,1
192.168.3.100: [2024-01-08 16:14:02,778] [INFO] [real_accelerator.py:161:get_accelerator] Setting ds_accelerator to cuda (auto detect)
192.168.3.101: [2024-01-08 16:14:02,794] [INFO] [real_accelerator.py:161:get_accelerator] Setting ds_accelerator to cuda (auto detect)
192.168.3.101: [2024-01-08 16:14:02,802] [INFO] [real_accelerator.py:161:get_accelerator] Setting ds_accelerator to cuda (auto detect)
192.168.3.100: [2024-01-08 16:14:02,807] [INFO] [real_accelerator.py:161:get_accelerator] Setting ds_accelerator to cuda (auto detect)
192.168.3.100: [2024-01-08 16:14:02,998] [INFO] [comm.py:637:init_distributed] cdb=None
192.168.3.100: [2024-01-08 16:14:03,032] [INFO] [comm.py:637:init_distributed] cdb=None
192.168.3.100: [2024-01-08 16:14:03,032] [INFO] [comm.py:668:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl
192.168.3.101: [2024-01-08 16:14:03,084] [INFO] [comm.py:637:init_distributed] cdb=None
192.168.3.101: [2024-01-08 16:14:03,110] [INFO] [comm.py:637:init_distributed] cdb=None
192.168.3.100: Files already downloaded and verified
192.168.3.101: Traceback (most recent call last):
192.168.3.101: File "/root/SYH/GithubCode/DeepSpeedExamples/training/cifar/cifar10_deepspeed.py", line 133, in <module>
192.168.3.101: torch.distributed.barrier()
192.168.3.101: File "/root/anaconda3/envs/DS/lib/python3.9/site-packages/torch/distributed/c10d_logger.py", line 47, in wrapper
192.168.3.101: return func(*args, **kwargs)
192.168.3.101: File "/root/anaconda3/envs/DS/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 3696, in barrier
192.168.3.101: work = default_pg.barrier(opts=opts)
192.168.3.101: torch.distributed.DistBackendError: NCCL error in: ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1333, remote process exited or there was a network error, NCCL version 2.18.6
192.168.3.101: ncclRemoteError: A call failed possibly due to a network error or a remote process exiting prematurely.
192.168.3.101: Last error:
192.168.3.101: socketProgressOpt: Call to recv from 192.168.3.101<47981> failed : Broken pipe
192.168.3.100: Traceback (most recent call last):
192.168.3.100: File "/root/SYH/GithubCode/DeepSpeedExamples/training/cifar/cifar10_deepspeed.py", line 133, in <module>
192.168.3.100: torch.distributed.barrier()
192.168.3.100: File "/root/anaconda3/envs/DS/lib/python3.9/site-packages/torch/distributed/c10d_logger.py", line 47, in wrapper
192.168.3.100: return func(*args, **kwargs)
192.168.3.100: File "/root/anaconda3/envs/DS/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 3696, in barrier
192.168.3.100: work = default_pg.barrier(opts=opts)
192.168.3.100: torch.distributed.DistBackendError: NCCL error in: ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1333, remote process exited or there was a network error, NCCL version 2.18.6
192.168.3.100: ncclRemoteError: A call failed possibly due to a network error or a remote process exiting prematurely.
192.168.3.100: Last error:
192.168.3.100: socketProgressOpt: Call to recv from 192.168.3.100<54189> failed : Broken pipe
192.168.3.100: [2024-01-08 16:14:07,091] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 3422546
192.168.3.101: [2024-01-08 16:14:07,133] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 1871110
192.168.3.100: [2024-01-08 16:14:07,256] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 3422547
192.168.3.100: [2024-01-08 16:14:07,257] [ERROR] [launch.py:321:sigkill_handler] ['/root/anaconda3/envs/DS/bin/python', '-u', 'cifar10_deepspeed.py', '--local_rank=1', '--deepspeed_config', 'ds_config.json'] exits with return code = 1
192.168.3.101: [2024-01-08 16:14:07,299] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 1871111
192.168.3.101: [2024-01-08 16:14:07,299] [ERROR] [launch.py:321:sigkill_handler] ['/root/anaconda3/envs/DS/bin/python', '-u', 'cifar10_deepspeed.py', '--local_rank=1', '--deepspeed_config', 'ds_config.json'] exits with return code = 1
pdsh@pai-worker1: 192.168.3.100: ssh exited with exit code 1
pdsh@pai-worker1: 192.168.3.101: ssh exited with exit code 1
It looks like something wrong with NCCL? I'm not sure how to fix it. Do you have any suggestions? :)
@Rainbowman0, to help with further investigation, can you try running the micro benchmarks here https://github.com/microsoft/DeepSpeedExamples/tree/master/benchmarks/communication
@Rainbowman0, to help with further investigation, can you try running the micro benchmarks here https://github.com/microsoft/DeepSpeedExamples/tree/master/benchmarks/communication
Thank you! I will have a try! :)
OHHH!! I think it is because the socket interface error. I have solved it by below steps:
First run: ifconfig to show the socket interfaces that can be used:
(I can't use 'eno1' interface, which is strange. So i use the 'br0' finally.)
open ~/.bashrc:
vim ~/.bashrc
write these commands at the bottom:
export NCCL_IB_DISABLE=1
export NCCL_SOCKET_IFNAME=br01 # for me it is 'br0' interface, you should use yours :)
export NCCL_P2P_DISABLE=1
export NCCL_DEBUG=INFO
reload ~/.bashrc
source ~/.bashrc
Both 192.168.3.100 and 192.168.3.101 follow the same steps. :)