DeepSpeedExamples icon indicating copy to clipboard operation
DeepSpeedExamples copied to clipboard

torch.distributed.DistBackendError: NCCL error in: ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1333, remote process exited or there was a network error, NCCL version 2.18.6

Open Rainbowman0 opened this issue 2 years ago • 4 comments

Hi, I successfully ran the 'cifar10_deepspeed.py' example on a single node (2xNVIDIA 3090). Now I want to run the same program on multi-nodes (2 nodes each have 2 3090s.). I refer to the example here to run my program.

my hostfile:

192.168.3.100 slots=2
192.168.3.101 slots=2

then I ran the command:

deepspeed --num_gpus 2 --num_nodes 2 --master_addr 192.168.3.100 --hostfile hostfile cifar10_deepspeed.py --deepspeed_config ds_config.json

I also ran these commands on both 192.168.3.100 and 192.168.3.101:

export NCCL_IB_DISABLE=1
export NCCL_SOCKET_IFNAME=eno1
export NCCL_P2P_DISABLE=1
export NCCL_DEBUG=INFO

And I met the error:

[2024-01-08 16:13:55,768] [INFO] [real_accelerator.py:161:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-01-08 16:13:56,462] [INFO] [multinode_runner.py:80:get_cmd] Running on the following workers: 192.168.3.100,192.168.3.101
[2024-01-08 16:13:56,463] [INFO] [runner.py:571:main] cmd = pdsh -S -f 1024 -w 192.168.3.100,192.168.3.101 source /root/SYH/GithubCode/DeepSpeedExamples/training/cifar/setup_env1.sh export NCCL_SOCKET_IFNAME=eno1; export PYTHONPATH=/root/SYH/GithubCode/DeepSpeedExamples/training/cifar;  cd /root/SYH/GithubCode/DeepSpeedExamples/training/cifar; /root/anaconda3/envs/DS/bin/python -u -m deepspeed.launcher.launch --world_info=eyIxOTIuMTY4LjMuMTAwIjogWzAsIDFdLCAiMTkyLjE2OC4zLjEwMSI6IFswLCAxXX0= --node_rank=%n --master_addr=192.168.3.100 --master_port=29500 cifar10_deepspeed.py --deepspeed_config ds_config.json
192.168.3.100: [2024-01-08 16:13:59,770] [INFO] [real_accelerator.py:161:get_accelerator] Setting ds_accelerator to cuda (auto detect)
192.168.3.101: [2024-01-08 16:13:59,842] [INFO] [real_accelerator.py:161:get_accelerator] Setting ds_accelerator to cuda (auto detect)
192.168.3.100: [2024-01-08 16:14:00,061] [INFO] [launch.py:145:main] WORLD INFO DICT: {'192.168.3.100': [0, 1], '192.168.3.101': [0, 1]}
192.168.3.100: [2024-01-08 16:14:00,061] [INFO] [launch.py:151:main] nnodes=2, num_local_procs=2, node_rank=0
192.168.3.100: [2024-01-08 16:14:00,061] [INFO] [launch.py:162:main] global_rank_mapping=defaultdict(<class 'list'>, {'192.168.3.100': [0, 1], '192.168.3.101': [2, 3]})
192.168.3.100: [2024-01-08 16:14:00,061] [INFO] [launch.py:163:main] dist_world_size=4
192.168.3.100: [2024-01-08 16:14:00,061] [INFO] [launch.py:165:main] Setting CUDA_VISIBLE_DEVICES=0,1
192.168.3.101: [2024-01-08 16:14:00,102] [INFO] [launch.py:145:main] WORLD INFO DICT: {'192.168.3.100': [0, 1], '192.168.3.101': [0, 1]}
192.168.3.101: [2024-01-08 16:14:00,102] [INFO] [launch.py:151:main] nnodes=2, num_local_procs=2, node_rank=1
192.168.3.101: [2024-01-08 16:14:00,103] [INFO] [launch.py:162:main] global_rank_mapping=defaultdict(<class 'list'>, {'192.168.3.100': [0, 1], '192.168.3.101': [2, 3]})
192.168.3.101: [2024-01-08 16:14:00,103] [INFO] [launch.py:163:main] dist_world_size=4
192.168.3.101: [2024-01-08 16:14:00,103] [INFO] [launch.py:165:main] Setting CUDA_VISIBLE_DEVICES=0,1
192.168.3.100: [2024-01-08 16:14:02,778] [INFO] [real_accelerator.py:161:get_accelerator] Setting ds_accelerator to cuda (auto detect)
192.168.3.101: [2024-01-08 16:14:02,794] [INFO] [real_accelerator.py:161:get_accelerator] Setting ds_accelerator to cuda (auto detect)
192.168.3.101: [2024-01-08 16:14:02,802] [INFO] [real_accelerator.py:161:get_accelerator] Setting ds_accelerator to cuda (auto detect)
192.168.3.100: [2024-01-08 16:14:02,807] [INFO] [real_accelerator.py:161:get_accelerator] Setting ds_accelerator to cuda (auto detect)
192.168.3.100: [2024-01-08 16:14:02,998] [INFO] [comm.py:637:init_distributed] cdb=None
192.168.3.100: [2024-01-08 16:14:03,032] [INFO] [comm.py:637:init_distributed] cdb=None
192.168.3.100: [2024-01-08 16:14:03,032] [INFO] [comm.py:668:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl
192.168.3.101: [2024-01-08 16:14:03,084] [INFO] [comm.py:637:init_distributed] cdb=None
192.168.3.101: [2024-01-08 16:14:03,110] [INFO] [comm.py:637:init_distributed] cdb=None
192.168.3.100: Files already downloaded and verified
192.168.3.101: Traceback (most recent call last):
192.168.3.101:   File "/root/SYH/GithubCode/DeepSpeedExamples/training/cifar/cifar10_deepspeed.py", line 133, in <module>
192.168.3.101:     torch.distributed.barrier()
192.168.3.101:   File "/root/anaconda3/envs/DS/lib/python3.9/site-packages/torch/distributed/c10d_logger.py", line 47, in wrapper
192.168.3.101:     return func(*args, **kwargs)
192.168.3.101:   File "/root/anaconda3/envs/DS/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 3696, in barrier
192.168.3.101:     work = default_pg.barrier(opts=opts)
192.168.3.101: torch.distributed.DistBackendError: NCCL error in: ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1333, remote process exited or there was a network error, NCCL version 2.18.6
192.168.3.101: ncclRemoteError: A call failed possibly due to a network error or a remote process exiting prematurely.
192.168.3.101: Last error:
192.168.3.101: socketProgressOpt: Call to recv from 192.168.3.101<47981> failed : Broken pipe
192.168.3.100: Traceback (most recent call last):
192.168.3.100:   File "/root/SYH/GithubCode/DeepSpeedExamples/training/cifar/cifar10_deepspeed.py", line 133, in <module>
192.168.3.100:     torch.distributed.barrier()
192.168.3.100:   File "/root/anaconda3/envs/DS/lib/python3.9/site-packages/torch/distributed/c10d_logger.py", line 47, in wrapper
192.168.3.100:     return func(*args, **kwargs)
192.168.3.100:   File "/root/anaconda3/envs/DS/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 3696, in barrier
192.168.3.100:     work = default_pg.barrier(opts=opts)
192.168.3.100: torch.distributed.DistBackendError: NCCL error in: ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1333, remote process exited or there was a network error, NCCL version 2.18.6
192.168.3.100: ncclRemoteError: A call failed possibly due to a network error or a remote process exiting prematurely.
192.168.3.100: Last error:
192.168.3.100: socketProgressOpt: Call to recv from 192.168.3.100<54189> failed : Broken pipe
192.168.3.100: [2024-01-08 16:14:07,091] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 3422546
192.168.3.101: [2024-01-08 16:14:07,133] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 1871110
192.168.3.100: [2024-01-08 16:14:07,256] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 3422547
192.168.3.100: [2024-01-08 16:14:07,257] [ERROR] [launch.py:321:sigkill_handler] ['/root/anaconda3/envs/DS/bin/python', '-u', 'cifar10_deepspeed.py', '--local_rank=1', '--deepspeed_config', 'ds_config.json'] exits with return code = 1
192.168.3.101: [2024-01-08 16:14:07,299] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 1871111
192.168.3.101: [2024-01-08 16:14:07,299] [ERROR] [launch.py:321:sigkill_handler] ['/root/anaconda3/envs/DS/bin/python', '-u', 'cifar10_deepspeed.py', '--local_rank=1', '--deepspeed_config', 'ds_config.json'] exits with return code = 1
pdsh@pai-worker1: 192.168.3.100: ssh exited with exit code 1
pdsh@pai-worker1: 192.168.3.101: ssh exited with exit code 1

It looks like something wrong with NCCL? I'm not sure how to fix it. Do you have any suggestions? :)

Rainbowman0 avatar Jan 08 '24 09:01 Rainbowman0

@Rainbowman0, to help with further investigation, can you try running the micro benchmarks here https://github.com/microsoft/DeepSpeedExamples/tree/master/benchmarks/communication

tjruwase avatar Jan 08 '24 14:01 tjruwase

@Rainbowman0, to help with further investigation, can you try running the micro benchmarks here https://github.com/microsoft/DeepSpeedExamples/tree/master/benchmarks/communication

Thank you! I will have a try! :)

Rainbowman0 avatar Jan 09 '24 01:01 Rainbowman0

OHHH!! I think it is because the socket interface error. I have solved it by below steps: First run: ifconfig to show the socket interfaces that can be used: image (I can't use 'eno1' interface, which is strange. So i use the 'br0' finally.)

open ~/.bashrc:

vim ~/.bashrc

write these commands at the bottom:

export NCCL_IB_DISABLE=1
export NCCL_SOCKET_IFNAME=br01 # for me it is 'br0' interface, you should use yours :)
export NCCL_P2P_DISABLE=1
export NCCL_DEBUG=INFO

reload ~/.bashrc

source ~/.bashrc

Both 192.168.3.100 and 192.168.3.101 follow the same steps. :)

Rainbowman0 avatar Jan 09 '24 03:01 Rainbowman0