Segmentation fault using multiple nodes with multi-gpu
Hello! I'm trying to run the benchmark_byteps.py from the step-by-step tutorial. But it seems I'm getting an error of segmentation fault during the NCCL ring set up stage -- I followed the instructions from distributed training using TCP (I'm adding a list of environmental variable at the very end of the post). Any suggestions on how to debug or what to look for in my setting? Thanks in advance.
I'm turning on bps logging and pasting the output below:
BytePS launching worker
!!!Enable profiling for WORKER_ID: 0 and local_rank: 0!!!
BYTEPS_TRACE_START_STEP: BYTEPS_TRACE_END_STEP: BYTEPS_TRACE_DIR: /mnt
Command: gdb -ex 'run' -ex 'bt' -batch --args python3 /mydata/lexu/byteps/example/scripts/benchmark_byteps.py --model resnet50 --num-iters 200 --batch-size 128
!!!Enable profiling for WORKER_ID: 0 and local_rank: 1!!!
BYTEPS_TRACE_START_STEP: BYTEPS_TRACE_END_STEP: BYTEPS_TRACE_DIR: /mnt
Command: gdb -ex 'run' -ex 'bt' -batch --args python3 /mydata/lexu/byteps/example/scripts/benchmark_byteps.py --model resnet50 --num-iters 200 --batch-size 128
!!!Enable profiling for WORKER_ID: 0 and local_rank: 2!!!
BYTEPS_TRACE_START_STEP: BYTEPS_TRACE_END_STEP: BYTEPS_TRACE_DIR: /mnt
Command: gdb -ex 'run' -ex 'bt' -batch --args python3 /mydata/lexu/byteps/example/scripts/benchmark_byteps.py --model resnet50 --num-iters 200 --batch-size 128
!!!Enable profiling for WORKER_ID: 0 and local_rank: 3!!!
BYTEPS_TRACE_START_STEP: BYTEPS_TRACE_END_STEP: BYTEPS_TRACE_DIR: /mnt
Command: gdb -ex 'run' -ex 'bt' -batch --args python3 /mydata/lexu/byteps/example/scripts/benchmark_byteps.py --model resnet50 --num-iters 200 --batch-size 128
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
[New Thread 0x7fffa5bf1700 (LWP 7225)]
[New Thread 0x7fffa33f0700 (LWP 7226)]
[New Thread 0x7fffa2bef700 (LWP 7227)]
[New Thread 0x7fff9e3ee700 (LWP 7228)]
[New Thread 0x7fff9bbed700 (LWP 7229)]
[New Thread 0x7fff993ec700 (LWP 7230)]
[New Thread 0x7fff96beb700 (LWP 7231)]
[New Thread 0x7fff943ea700 (LWP 7232)]
[New Thread 0x7fff93be9700 (LWP 7233)]
[New Thread 0x7fffa5bf1700 (LWP 7235)]
[New Thread 0x7fff8f3e8700 (LWP 7234)]
[New Thread 0x7fffa33f0700 (LWP 7236)]
[New Thread 0x7fff8cbe7700 (LWP 7237)]
[New Thread 0x7fff8a3e6700 (LWP 7239)]
[New Thread 0x7fffa0bef700 (LWP 7238)]
[New Thread 0x7fff87be5700 (LWP 7240)]
[New Thread 0x7fff9e3ee700 (LWP 7241)]
[New Thread 0x7fff9bbed700 (LWP 7242)]
[New Thread 0x7fff9b3ec700 (LWP 7244)]
[New Thread 0x7fff853e4700 (LWP 7243)]
[New Thread 0x7fff98beb700 (LWP 7245)]
[New Thread 0x7fff84be3700 (LWP 7247)]
[New Thread 0x7fff963ea700 (LWP 7246)]
[New Thread 0x7fff91be9700 (LWP 7248)]
[New Thread 0x7fff8f3e8700 (LWP 7250)]
[New Thread 0x7fff8ebe7700 (LWP 7251)]
[New Thread 0x7fff8a3e6700 (LWP 7253)]
[New Thread 0x7fff87be5700 (LWP 7254)]
[New Thread 0x7fff853e4700 (LWP 7255)]
[New Thread 0x7fff82be3700 (LWP 7256)]
[New Thread 0x7fff803e2700 (LWP 7249)]
[New Thread 0x7fff803e2700 (LWP 7257)]
[New Thread 0x7fff7fbe1700 (LWP 7258)]
[New Thread 0x7fff7fbe1700 (LWP 7259)]
[New Thread 0x7fffa5bf1700 (LWP 7252)]
[New Thread 0x7fff7d3e0700 (LWP 7260)]
[New Thread 0x7fff7b3e0700 (LWP 7261)]
[New Thread 0x7fffa53f0700 (LWP 7262)]
[New Thread 0x7fff7abdf700 (LWP 7263)]
[New Thread 0x7fff78bdf700 (LWP 7264)]
[New Thread 0x7fffa0bef700 (LWP 7265)]
[New Thread 0x7fff763de700 (LWP 7266)]
[New Thread 0x7fff763de700 (LWP 7267)]
[New Thread 0x7fff9e3ee700 (LWP 7268)]
[New Thread 0x7fff73bdd700 (LWP 7269)]
[New Thread 0x7fff73bdd700 (LWP 7270)]
[New Thread 0x7fff713dc700 (LWP 7273)]
[New Thread 0x7fff9bbed700 (LWP 7271)]
[New Thread 0x7fff713dc700 (LWP 7272)]
[New Thread 0x7fff70bdb700 (LWP 7274)]
[New Thread 0x7fff9b3ec700 (LWP 7275)]
[New Thread 0x7fff6ebdb700 (LWP 7276)]
[New Thread 0x7fff6e3da700 (LWP 7277)]
[New Thread 0x7fff6bbd9700 (LWP 7280)]
[New Thread 0x7fff693d8700 (LWP 7281)]
[New Thread 0x7fff96beb700 (LWP 7278)]
[New Thread 0x7fff6c3da700 (LWP 7279)]
[New Thread 0x7fff66bd7700 (LWP 7282)]
[New Thread 0x7fff643d6700 (LWP 7285)]
[New Thread 0x7fff6bbd9700 (LWP 7284)]
[New Thread 0x7fff943ea700 (LWP 7283)]
[New Thread 0x7fff61bd5700 (LWP 7286)]
[New Thread 0x7fff693d8700 (LWP 7287)]
[New Thread 0x7fff91be9700 (LWP 7288)]
[New Thread 0x7fff5f3d4700 (LWP 7289)]
[New Thread 0x7fff66bd7700 (LWP 7290)]
[New Thread 0x7fff5cbd3700 (LWP 7292)]
[New Thread 0x7fff643d6700 (LWP 7293)]
[New Thread 0x7fff913e8700 (LWP 7291)]
[New Thread 0x7fff61bd5700 (LWP 7294)]
[New Thread 0x7fff5d3d4700 (LWP 7296)]
[New Thread 0x7fff8ebe7700 (LWP 7295)]
[New Thread 0x7fff5abd3700 (LWP 7297)]
[New Thread 0x7fff8c3e6700 (LWP 7298)]
[New Thread 0x7fff89be5700 (LWP 7299)]
[New Thread 0x7fff853e4700 (LWP 7300)]
[New Thread 0x7fff84be3700 (LWP 7301)]
[New Thread 0x7fff823e2700 (LWP 7302)]
[New Thread 0x7fff7fbe1700 (LWP 7303)]
[New Thread 0x7fff7d3e0700 (LWP 7304)]
[New Thread 0x7fff7abdf700 (LWP 7305)]
[New Thread 0x7fff783de700 (LWP 7306)]
[New Thread 0x7fff73bdd700 (LWP 7307)]
[New Thread 0x7fff713dc700 (LWP 7308)]
[New Thread 0x7fff6ebdb700 (LWP 7309)]
[New Thread 0x7fff6c3da700 (LWP 7310)]
[New Thread 0x7fff69bd9700 (LWP 7311)]
[New Thread 0x7fff673d8700 (LWP 7312)]
[New Thread 0x7fff66bd7700 (LWP 7313)]
[New Thread 0x7fff643d6700 (LWP 7314)]
[New Thread 0x7fff61bd5700 (LWP 7315)]
[New Thread 0x7fff5f3d4700 (LWP 7316)]
[New Thread 0x7fff5abd3700 (LWP 7317)]
[New Thread 0x7fffa5bf1700 (LWP 7318)]
[New Thread 0x7fffa53f0700 (LWP 7319)]
[New Thread 0x7fffa2bef700 (LWP 7320)]
[New Thread 0x7fffa03ee700 (LWP 7321)]
[New Thread 0x7fff9bbed700 (LWP 7322)]
[New Thread 0x7fff9b3ec700 (LWP 7323)]
[New Thread 0x7fff96beb700 (LWP 7324)]
[New Thread 0x7fff963ea700 (LWP 7325)]
[New Thread 0x7fff93be9700 (LWP 7326)]
[New Thread 0x7fff8f3e8700 (LWP 7327)]
[New Thread 0x7fff8cbe7700 (LWP 7328)]
[New Thread 0x7fff8a3e6700 (LWP 7329)]
[New Thread 0x7fff87be5700 (LWP 7330)]
[New Thread 0x7fff853e4700 (LWP 7331)]
[New Thread 0x7fff84be3700 (LWP 7332)]
[New Thread 0x7fff803e2700 (LWP 7333)]
[New Thread 0x7fff7dbe1700 (LWP 7334)]
[New Thread 0x7fff7b3e0700 (LWP 7335)]
[New Thread 0x7fff78bdf700 (LWP 7336)]
[New Thread 0x7fff763de700 (LWP 7337)]
[New Thread 0x7fff73bdd700 (LWP 7338)]
[New Thread 0x7fff713dc700 (LWP 7339)]
[New Thread 0x7fff6ebdb700 (LWP 7340)]
[New Thread 0x7fff6e3da700 (LWP 7341)]
[New Thread 0x7fff6bbd9700 (LWP 7342)]
[New Thread 0x7fff673d8700 (LWP 7343)]
[New Thread 0x7fff64bd7700 (LWP 7344)]
[New Thread 0x7fff643d6700 (LWP 7345)]
[New Thread 0x7fff61bd5700 (LWP 7346)]
[New Thread 0x7fff5f3d4700 (LWP 7347)]
[New Thread 0x7fff5abd3700 (LWP 7348)]
[2020-11-12 00:46:40.186084: I byteps/common/compressor/compressor_registry.cc:28] dithering_compressor compressor is registered
[2020-11-12 00:46:40.186090: I byteps/common/compressor/compressor_registry.cc:28] dithering_compressor compressor is registered
[2020-11-12 00:46:40.186134: I byteps/common/compressor/compressor_registry.cc:28] onebit_compressor compressor is registered
[2020-11-12 00:46:40.186147: I byteps/common/compressor/compressor_registry.cc:28] randomk_compressor compressor is registered
[2020-11-12 00:46:40.186150: I byteps/common/compressor/compressor_registry.cc:28] onebit_compressor compressor is registered
[2020-11-12 00:46:40.186156: I byteps/common/compressor/compressor_registry.cc:28] topk_compressor compressor is registered
[2020-11-12 00:46:40.186164: I byteps/common/compressor/compressor_registry.cc:28] randomk_compressor compressor is registered
[2020-11-12 00:46:40.186168: I byteps/common/compressor/compressor_registry.cc:28] vanilla_ef compressor is registered
[2020-11-12 00:46:40.186175: I byteps/common/compressor/compressor_registry.cc:28] topk_compressor compressor is registered
[2020-11-12 00:46:40.186178: I byteps/common/compressor/compressor_registry.cc:28] nesterov_momentum compressor is registered
[2020-11-12 00:46:40.186188: I byteps/common/compressor/compressor_registry.cc:28] vanilla_ef compressor is registered
[2020-11-12 00:46:40.186198: I byteps/common/compressor/compressor_registry.cc:28] nesterov_momentum compressor is registered
[2020-11-12 00:46:40.266210: I byteps/common/compressor/compressor_registry.cc:28] dithering_compressor compressor is registered
[2020-11-12 00:46:40.266277: I byteps/common/compressor/compressor_registry.cc:28] onebit_compressor compressor is registered
[2020-11-12 00:46:40.266326: I byteps/common/compressor/compressor_registry.cc:28] randomk_compressor compressor is registered
[2020-11-12 00:46:40.266341: I byteps/common/compressor/compressor_registry.cc:28] topk_compressor compressor is registered
[2020-11-12 00:46:40.266357: I byteps/common/compressor/compressor_registry.cc:28] vanilla_ef compressor is registered
[2020-11-12 00:46:40.266371: I byteps/common/compressor/compressor_registry.cc:28] nesterov_momentum compressor is registered
[2020-11-12 00:46:40.316032: I byteps/common/compressor/compressor_registry.cc:28] dithering_compressor compressor is registered
[2020-11-12 00:46:40.316083: I byteps/common/compressor/compressor_registry.cc:28] onebit_compressor compressor is registered
[2020-11-12 00:46:40.316094: I byteps/common/compressor/compressor_registry.cc:28] randomk_compressor compressor is registered
[2020-11-12 00:46:40.316105: I byteps/common/compressor/compressor_registry.cc:28] topk_compressor compressor is registered
[2020-11-12 00:46:40.316115: I byteps/common/compressor/compressor_registry.cc:28] vanilla_ef compressor is registered
[2020-11-12 00:46:40.316123: I byteps/common/compressor/compressor_registry.cc:28] nesterov_momentum compressor is registered
[New Thread 0x7fff46610700 (LWP 7361)]
[New Thread 0x7fff46610700 (LWP 7362)]
[New Thread 0x7fff46610700 (LWP 7363)]
[New Thread 0x7fff45dcf700 (LWP 7364)]
[New Thread 0x7fff455ce700 (LWP 7365)]
[New Thread 0x7fff45dcf700 (LWP 7366)]
[New Thread 0x7fff44dcd700 (LWP 7367)]
[New Thread 0x7fff45dcf700 (LWP 7368)]
[New Thread 0x7fff46610700 (LWP 7369)]
[New Thread 0x7fff45dcf700 (LWP 7370)]
Thread 1 "python3" received signal SIGSEGV, Segmentation fault.
0x00007fff4ddac62e in ncclGetUniqueId () from /usr/local/lib/python3.6/dist-packages/byteps/torch/c_lib.cpython-36m-x86_64-linux-gnu.so
#0 0x00007fff4ddac62e in ncclGetUniqueId () from /usr/local/lib/python3.6/dist-packages/byteps/torch/c_lib.cpython-36m-x86_64-linux-gnu.so
#1 0x00007fff4dd04108 in byteps::common::NcclManager::ConstructRings (this=this@entry=0x276c120) at byteps/common/nccl_manager.cc:102
#2 0x00007fff4dd04beb in byteps::common::NcclManager::NcclManager (this=0x276c120, comm=...) at byteps/common/nccl_manager.cc:50
#3 0x00007fff4dcecf62 in __gnu_cxx::new_allocator<byteps::common::NcclManager>::construct<byteps::common::NcclManager, std::shared_ptr<byteps::common::BytePSComm>&> (this=<optimized out>, __p=0x276c120) at /usr/include/c++/7/ext/new_allocator.h:136
#4 std::allocator_traits<std::allocator<byteps::common::NcclManager> >::construct<byteps::common::NcclManager, std::shared_ptr<byteps::common::BytePSComm>&> (__a=..., __p=<optimized out>) at /usr/include/c++/7/bits/alloc_traits.h:475
#5 std::_Sp_counted_ptr_inplace<byteps::common::NcclManager, std::allocator<byteps::common::NcclManager>, (__gnu_cxx::_Lock_policy)2>::_Sp_counted_ptr_inplace<std::shared_ptr<byteps::common::BytePSComm>&> (__a=..., this=0x276c110) at /usr/include/c++/7/bits/shared_ptr_base.h:526
#6 std::__shared_count<(__gnu_cxx::_Lock_policy)2>::__shared_count<byteps::common::NcclManager, std::allocator<byteps::common::NcclManager>, std::shared_ptr<byteps::common::BytePSComm>&> (__a=..., this=<optimized out>) at /usr/include/c++/7/bits/shared_ptr_base.h:637
#7 std::__shared_ptr<byteps::common::NcclManager, (__gnu_cxx::_Lock_policy)2>::__shared_ptr<std::allocator<byteps::common::NcclManager>, std::shared_ptr<byteps::common::BytePSComm>&> (__a=..., __tag=..., this=<optimized out>) at /usr/include/c++/7/bits/shared_ptr_base.h:1295
#8 std::shared_ptr<byteps::common::NcclManager>::shared_ptr<std::allocator<byteps::common::NcclManager>, std::shared_ptr<byteps::common::BytePSComm>&> (__a=..., __tag=..., this=<optimized out>) at /usr/include/c++/7/bits/shared_ptr.h:344
#9 std::allocate_shared<byteps::common::NcclManager, std::allocator<byteps::common::NcclManager>, std::shared_ptr<byteps::common::BytePSComm>&> (__a=...) at /usr/include/c++/7/bits/shared_ptr.h:691
#10 std::make_shared<byteps::common::NcclManager, std::shared_ptr<byteps::common::BytePSComm>&> () at /usr/include/c++/7/bits/shared_ptr.h:707
#11 byteps::common::BytePSGlobal::Init () at byteps/common/global.cc:191
#12 0x00007fff4dcd04d1 in byteps::common::byteps_lazy_init () at byteps/common/operations.cc:42
#13 0x00007ffff6308dae in ffi_call_unix64 () from /usr/lib/x86_64-linux-gnu/libffi.so.6
#14 0x00007ffff630871f in ffi_call () from /usr/lib/x86_64-linux-gnu/libffi.so.6
#15 0x00007ffff651c7e3 in _ctypes_callproc () from /usr/lib/python3.6/lib-dynload/_ctypes.cpython-36m-x86_64-linux-gnu.so
#16 0x00007ffff651cc33 in ?? () from /usr/lib/python3.6/lib-dynload/_ctypes.cpython-36m-x86_64-linux-gnu.so
#17 0x00000000005aa6ec in _PyObject_FastCallKeywords ()
#18 0x000000000050abb3 in ?? ()
#19 0x000000000050c5b9 in _PyEval_EvalFrameDefault ()
#20 0x0000000000508245 in ?? ()
#21 0x000000000050a080 in ?? ()
#22 0x000000000050aa7d in ?? ()
#23 0x000000000050c5b9 in _PyEval_EvalFrameDefault ()
#24 0x0000000000508245 in ?? ()
#25 0x000000000050b403 in PyEval_EvalCode ()
#26 0x0000000000635222 in ?? ()
#27 0x00000000006352d7 in PyRun_FileExFlags ()
#28 0x0000000000638a8f in PyRun_SimpleFileExFlags ()
#29 0x0000000000639631 in Py_Main ()
#30 0x00000000004b0f40 in main ()
Here is a list of environmental variable I used:
+ export DMLC_WORKER_ID=0
+ DMLC_WORKER_ID=0
+ export DMLC_NUM_WORKER=2
+ DMLC_NUM_WORKER=2
+ export DMLC_ROLE=worker
+ DMLC_ROLE=worker
+ export DMLC_NUM_SERVER=2
+ DMLC_NUM_SERVER=2
+ export DMLC_PS_ROOT_URI=10.10.1.1
+ DMLC_PS_ROOT_URI=10.10.1.1
+ export DMLC_PS_ROOT_PORT=12345
+ DMLC_PS_ROOT_PORT=12345
+ SCRIPT=/mydata/lexu/byteps/example/scripts/benchmark_byteps.py
+ MODEL=resnet50
+ NUM_ITERS=200
+ BATCH_SIZE=128
+ EPOCH=100
+ OUTPUT_FILE=/mnt/bps-worker-resnet50-BS128-node0-link-1
+ DATASET=/mnt/tiny-imagenet-200
+ export BYTEPS_LOCAL_RANK=0
+ BYTEPS_LOCAL_RANK=0
+ export BYTEPS_LOCAL_SIZE=2
+ BYTEPS_LOCAL_SIZE=2
+ export BYTEPS_ENABLE_GDB=1
+ BYTEPS_ENABLE_GDB=1
+ export BYTEPS_LOG_LEVEL=INFO
+ BYTEPS_LOG_LEVEL=INFO
+ export BYTEPS_FORCE_DISTRIBUTED=1
+ BYTEPS_FORCE_DISTRIBUTED=1
+ export PS_VERBOSE=2
+ PS_VERBOSE=2
+ export PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/usr/local/cuda/bin
+ PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/usr/local/cuda/bin
+ export NVIDIA_VISIBLE_DEVICES=0,1,2,3
+ NVIDIA_VISIBLE_DEVICES=0,1,2,3
+ export CUDA_VISIBLE_DEVICES=0,1,2,3
+ CUDA_VISIBLE_DEVICES=0,1,2,3
+ export IFNAME=enp94s0f0
+ IFNAME=enp94s0f0
+ echo enp94s0f0
enp94s0f0
+ export export NCCL_SOCKET_IFNAME=enp94s0f0
+ NCCL_SOCKET_IFNAME=enp94s0f0
+ export NCCL_DEBUG=DEBUG
+ NCCL_DEBUG=DEBUG
+ export BYTEPS_TRACE_ON=1
+ BYTEPS_TRACE_ON=1
+ export BYTEPS_TRACE_DIR=/mnt
+ BYTEPS_TRACE_DIR=/mnt
+ OUTPUT_FILE=/mnt/bps-worker-resnet50-BS128-node0-link-1-benchmark
+ bpslaunch python3 /mydata/lexu/byteps/example/scripts/benchmark_byteps.py --model resnet50 --num-iters 200 --batch-size 128
+ id=1
+ sleep 2
+ read -u 3 -r SLAVE
+ OUTPUT_FILE=/mnt/bps-worker-resnet50-BS128-node1-link-1
+ WORKER_ID=1
+ command='/mydata/lexu/byteps/example/scripts/bps_worker.sh 1 2 2 10.10.1.1 12345 /mydata/lexu/byteps/example/scripts/benchmark_byteps.py resnet50 200 128 100 /mnt/bps-worker-resnet50-BS128-node1-link-1 /mnt/tiny-imagenet-200'
+ ssh -o StrictHostKeyChecking=no node1-link-1 /mydata/lexu/byteps/example/scripts/bps_worker.sh 1 2 2 10.10.1.1 12345 /mydata/lexu/byteps/example/scripts/benchmark_byteps.py resnet50 200 128 100 /mnt/bps-worker-resnet50-BS128-node1-link-1 /mnt/tiny-imagenet-200
Your title is "Segmentation fault using multiple nodes with multi-gpu". Are you sure about this? From the log it seems only relevant to NCCL, and should also happen using a single node.