RDMA_CM_EVENT_ADDR_ERROR raised when running distributed training with PyTorch

Open anj-s opened this issue 5 years ago • 0 comments

Describe the bug I am unable to get distributed training running with PyTorch backend. I am consistently running into the RDMA_CM_EVENT_ADDR_ERROR. Can someone take a look and let me know if I am missing something?

Run setup: 2 nodes node 0: worker 0 node 1: worker 1, server, scheduler

scheduler_hostname = IP of the RDMA interface

perf test works using ib_write_bw single node training works

To Reproduce General env vars that are set on workers, scheduler and server os.environ["DMLC_ENABLE_RDMA"] = "ibverbs" os.environ["DMLC_INTERFACE"] = "front0" os.environ["ENABLE_RDMA_LOG"] = "1" os.environ["PS_VERBOSE"] = "1" os.environ["BYTEPS_LOG_LEVEL"] = "TRACE" os.environ["NCCL_DEBUG"] = "INFO" os.environ["NCCL_SHM_DISABLE"] = "1" os.environ["BYTEPS_ENABLE_GDB"] = "0" os.environ["BYTEPS_RDMA_RX_DEPTH"]="128" os.environ["BYTEPS_RDMA_START_DEPTH"]="16"

server env vars os.environ["DMLC_ROLE"] = "server" os.environ["DMLC_NUM_WORKER"] = 2 os.environ["DMLC_NUM_SERVER"] = 1 os.environ["DMLC_PS_ROOT_URI"] = scheduler_hostname os.environ["DMLC_PS_ROOT_PORT"] = SCHEDULER_PORT

scheduler env vars os.environ["DMLC_ROLE"] = "scheduler" os.environ["DMLC_NUM_WORKER"] = "2" os.environ["DMLC_NUM_SERVER"] = "1" os.environ["DMLC_PS_ROOT_URI"] = scheduler_hostname os.environ["DMLC_PS_ROOT_PORT"] = SCHEDULER_PORT

worker env vars os.environ["DMLC_ROLE"] = "worker" os.environ["DMLC_WORKER_ID"] = str(worker_id) os.environ["DMLC_NUM_WORKER"] = "2" os.environ["DMLC_NUM_SERVER"] = "1" os.environ["DMLC_PS_ROOT_URI"] = scheduler_hostname os.environ["DMLC_PS_ROOT_PORT"] = SCHEDULER_PORT os.environ["BYTEPS_LOCAL_RANK"] = "0" os.environ["BYTEPS_LOCAL_SIZE"] = "1'

Expected behavior Able to run:

command = "python /private/home/anj/.conda/envs/fairscale/bin/bpslaunch python
                       /private/home/anj/byteps_repro/byteps/example/pytorch/train_mnist_byteps.py"
subprocess.check_call(command,
                      stdout=sys.stdout, stderr=sys.stderr, shell=True)

stack trace: https://gist.github.com/anj-s/6c808731287e9a504cb63c6f8013fad0

Screenshots If applicable, add screenshots to help explain your problem.

Environment (please complete the following information): OS: Ubuntu GCC version: gcc version 9.3.0 (Ubuntu 9.3.0-17ubuntu1~20.04) CUDA and NCCL version: CUDA: 11.0 NCCL: 2.7.8 Framework (TF, PyTorch, MXNet): PyTorch 1.8

Additional context Add any other context about the problem here.

Apr 30 '21 04:04 anj-s