PocketFlow icon indicating copy to clipboard operation
PocketFlow copied to clipboard

can't runing multi-GPU

Open liaocz opened this issue 7 years ago • 7 comments

when I use one GPU and it finished without any problem , but when using multi-GPU, it hung when runing bcast operation, I don't know how to solve it. code: channel_pruning_gpu/learner.py:149

liaocz avatar Jan 07 '19 06:01 liaocz

Which environment are you using? Horovod in the native environment, or via a docker image?

jiaxiang-wu avatar Jan 07 '19 11:01 jiaxiang-wu

native environment

liaocz avatar Jan 07 '19 11:01 liaocz

You may need to check whether Horovod's network options are set properly ("eth1" parts), according to your native environment's network configuration.

  options="-np ${nb_gpus} -H localhost:${nb_gpus} -bind-to none -map-by slot
      -x NCCL_DEBUG=INFO -x NCCL_SOCKET_IFNAME=eth1 -x NCCL_IB_DISABLE=1
      -x LD_LIBRARY_PATH --mca btl_tcp_if_include eth1"
  mpirun ${options} python main.py --enbl_multi_gpu ${extra_args}

jiaxiang-wu avatar Jan 07 '19 11:01 jiaxiang-wu

thank you for your answering, i have exported the environment NCCL_SOCKET_IFNAME=eth1 but it's not working for me, it still hang when i using 2 GPU on one node. if i comment the bcast, it will continue running,do you have any idea?

liaocz avatar Jan 10 '19 01:01 liaocz

thank you for your answering, i have exported the environment NCCL_SOCKET_IFNAME=eth1 but it's not working for me, it still hang when i using 2 GPU on one node. if i comment the bcast, it will continue running,do you have any idea?

Hi liaocz, Could you paste the log file? So we can help to figure out root cause.

jinhou avatar Jan 17 '19 02:01 jinhou

2019-01-17 11:30:31.451159: I tensorflow/core/common_runtime/gpu/gpu_device.cc:971] 0 2019-01-17 11:30:31.451169: I tensorflow/core/common_runtime/gpu/gpu_device.cc:984] 0: N 2019-01-17 11:30:31.451695: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1097] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 21546 MB memory) -> physical GPU (device: 0, name: Tesla P40, pci bus id: 0000:83:00.0, compute capability: 6.1) INFO:tensorflow:creating a pruning mask for pruned_model/resnet_model/conv2d/kernel:0 of size (7, 7, 3, 64) INFO:tensorflow:creating a pruning mask for pruned_model/resnet_model/conv2d_1/kernel:0 of size (1, 1, 64, 64) INFO:tensorflow:creating a pruning mask for pruned_model/resnet_model/conv2d_2/kernel:0 of size (3, 3, 64, 64) INFO:tensorflow:creating a pruning mask for pruned_model/resnet_model/conv2d_3/kernel:0 of size (3, 3, 64, 64) INFO:tensorflow:creating a pruning mask for pruned_model/resnet_model/conv2d_4/kernel:0 of size (3, 3, 64, 64) INFO:tensorflow:creating a pruning mask for pruned_model/resnet_model/conv2d_5/kernel:0 of size (3, 3, 64, 64) INFO:tensorflow:creating a pruning mask for pruned_model/resnet_model/conv2d_6/kernel:0 of size (1, 1, 64, 128) INFO:tensorflow:creating a pruning mask for pruned_model/resnet_model/conv2d_7/kernel:0 of size (3, 3, 64, 128) INFO:tensorflow:creating a pruning mask for pruned_model/resnet_model/conv2d_8/kernel:0 of size (3, 3, 128, 128) INFO:tensorflow:creating a pruning mask for pruned_model/resnet_model/conv2d_9/kernel:0 of size (3, 3, 128, 128) INFO:tensorflow:creating a pruning mask for pruned_model/resnet_model/conv2d_10/kernel:0 of size (3, 3, 128, 128) INFO:tensorflow:creating a pruning mask for pruned_model/resnet_model/conv2d_11/kernel:0 of size (1, 1, 128, 256) INFO:tensorflow:creating a pruning mask for pruned_model/resnet_model/conv2d_12/kernel:0 of size (3, 3, 128, 256) INFO:tensorflow:creating a pruning mask for pruned_model/resnet_model/conv2d_13/kernel:0 of size (3, 3, 256, 256) INFO:tensorflow:creating a pruning mask for pruned_model/resnet_model/conv2d_14/kernel:0 of size (3, 3, 256, 256) INFO:tensorflow:creating a pruning mask for pruned_model/resnet_model/conv2d_15/kernel:0 of size (3, 3, 256, 256) INFO:tensorflow:creating a pruning mask for pruned_model/resnet_model/conv2d_16/kernel:0 of size (1, 1, 256, 512) INFO:tensorflow:creating a pruning mask for pruned_model/resnet_model/conv2d_17/kernel:0 of size (3, 3, 256, 512) INFO:tensorflow:creating a pruning mask for pruned_model/resnet_model/conv2d_18/kernel:0 of size (3, 3, 512, 512) INFO:tensorflow:creating a pruning mask for pruned_model/resnet_model/conv2d_19/kernel:0 of size (3, 3, 512, 512) INFO:tensorflow:creating a pruning mask for pruned_model/resnet_model/conv2d_20/kernel:0 of size (3, 3, 512, 512) 2019-01-17 11:30:38.546495: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1484] Adding visible gpu devices: 0 2019-01-17 11:30:38.546564: I tensorflow/core/common_runtime/gpu/gpu_device.cc:965] Device interconnect StreamExecutor with strength 1 edge matrix: 2019-01-17 11:30:38.546575: I tensorflow/core/common_runtime/gpu/gpu_device.cc:971] 0 2019-01-17 11:30:38.546582: I tensorflow/core/common_runtime/gpu/gpu_device.cc:984] 0: N 2019-01-17 11:30:38.546796: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1097] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 21546 MB memory) -> physical GPU (device: 0, name: Tesla P40, pci bus id: 0000:83:00.0, compute capability: 6.1) INFO:tensorflow:begin restoring model from checkpoint file INFO:tensorflow:/mnt/PocketFlow/pretrain_models/models INFO:tensorflow:/mnt/PocketFlow/pretrain_models/models/model.ckpt-250227 INFO:tensorflow:Restoring parameters from /mnt/PocketFlow/pretrain_models/models/model.ckpt-250227 INFO:tensorflow:finish restoring model from checkpoint file INFO:tensorflow:name: "group_deps"

完成了checkpoint file的restoring后就hang住了

liaocz avatar Jan 17 '19 03:01 liaocz

@liaocz We do not have a clue for the moment. This is more like a horovod-related issue. Maybe you can find some help here? https://github.com/uber/horovod

jiaxiang-wu avatar Jan 21 '19 03:01 jiaxiang-wu