DeepLearningExamples icon indicating copy to clipboard operation
DeepLearningExamples copied to clipboard

[Efficientdet/TF2] Training script to 4 GPUs systems

Open vasanth1986 opened this issue 2 years ago • 0 comments

TensorFlow2/Detection/Efficientdet: Please help me to modify train.py or other *.py files to run on 4 GPUs systems

The convergence script (convergence-AMP-8xA100-80G.sh) is running without any issue on 8 GPUs systems. However, getting Missing ranks and Horovod internal error when running the same script with modification to CUDA_VISIBLE_DEVICES=0,1,2,3(4 GPUs).

Error Message:

[2023-06-27 10:46:24.525097: W /opt/tensorflow/horovod-source/horovod/common/stall_inspector.cc:107] One or more tensors were submitted to be reduced, gathered or broadcasted by subset of ranks and are waiting for remainder of ranks for more than 60 seconds. This may indicate that different ranks are trying to submit different tensors or that only subset of ranks is submitting tensors, which will cause deadlock. Missing ranks: 0: [DistributedGradientTape_Allreduce/cond_212/then/_1696/DistributedGradientTape_Allreduce/cond_212/HorovodAllreduce_DistributedGradientTape_Allreduce_cond_212_Cast_1_0, DistributedGradientTape_Allreduce/cond_213/then/_1704/DistributedGradientTape_Allreduce/cond_213/HorovodAllreduce_DistributedGradientTape_Allreduce_cond_213_Cast_1_0, DistributedGradientTape_Allreduce/cond_214/then/_1712/DistributedGradientTape_Allreduce/cond_214/HorovodAllreduce_DistributedGradientTape_Allreduce_cond_214_Cast_1_0, DistributedGradientTape_Allreduce/cond_215/then/_1720/DistributedGradientTape_Allreduce/cond_215/HorovodAllreduce_DistributedGradientTape_Allreduce_cond_215_Cast_1_0, DistributedGradientTape_Allreduce/cond_216/then/_1728/DistributedGradientTape_Allreduce/cond_216/HorovodAllreduce_DistributedGradientTape_Allreduce_cond_216_Cast_1_0, DistributedGradientTape_Allreduce/cond_217/then/_1736/DistributedGradientTape_Allreduce/cond_217/HorovodAllreduce_DistributedGradientTape_Allreduce_cond_217_Cast_1_0 ...] 1: [DistributedGradientTape_Allreduce/cond_212/then/_1696/DistributedGradientTape_Allreduce/cond_212/HorovodAllreduce_DistributedGradientTape_Allreduce_cond_212_Cast_1_0, DistributedGradientTape_Allreduce/cond_213/then/_1704/DistributedGradientTape_Allreduce/cond_213/HorovodAllreduce_DistributedGradientTape_Allreduce_cond_213_Cast_1_0, DistributedGradientTape_Allreduce/cond_214/then/_1712/DistributedGradientTape_Allreduce/cond_214/HorovodAllreduce_DistributedGradientTape_Allreduce_cond_214_Cast_1_0, DistributedGradientTape_Allreduce/cond_215/then/_1720/DistributedGradientTape_Allreduce/cond_215/HorovodAllreduce_DistributedGradientTape_Allreduce_cond_215_Cast_1_0, DistributedGradientTape_Allreduce/cond_216/then/_1728/DistributedGradientTape_Allreduce/cond_216/HorovodAllreduce_DistributedGradientTape_Allreduce_cond_216_Cast_1_0, DistributedGradientTape_Allreduce/cond_217/then/_1736/DistributedGradientTape_Allreduce/cond_217/HorovodAllreduce_DistributedGradientTape_Allreduce_cond_217_Cast_1_0 ...] 2: [DistributedGradientTape_Allreduce/cond_212/then/_1696/DistributedGradientTape_Allreduce/cond_212/HorovodAllreduce_DistributedGradientTape_Allreduce_cond_212_Cast_1_0, DistributedGradientTape_Allreduce/cond_213/then/_1704/DistributedGradientTape_Allreduce/cond_213/HorovodAllreduce_DistributedGradientTape_Allreduce_cond_213_Cast_1_0, DistributedGradientTape_Allreduce/cond_214/then/_1712/DistributedGradientTape_Allreduce/cond_214/HorovodAllreduce_DistributedGradientTape_Allreduce_cond_214_Cast_1_0, DistributedGradientTape_Allreduce/cond_215/then/_1720/DistributedGradientTape_Allreduce/cond_215/HorovodAllreduce_DistributedGradientTape_Allreduce_cond_215_Cast_1_0, DistributedGradientTape_Allreduce/cond_216/then/_1728/DistributedGradientTape_Allreduce/cond_216/HorovodAllreduce_DistributedGradientTape_Allreduce_cond_216_Cast_1_0, DistributedGradientTape_Allreduce/cond_217/then/_1736/DistributedGradientTape_Allreduce/cond_217/HorovodAllreduce_DistributedGradientTape_Allreduce_cond_217_Cast_1_0 ...] Traceback (most recent call last): File "train.py", line 336, in app.run(main) File "/usr/local/lib/python3.8/dist-packages/absl/app.py", line 303, in run _run_main(main, args) File "/usr/local/lib/python3.8/dist-packages/absl/app.py", line 251, in _run_main sys.exit(main(argv)) File "train.py", line 231, in main history = model.fit( File "/usr/local/lib/python3.8/dist-packages/keras/utils/traceback_utils.py", line 67, in error_handler raise e.with_traceback(filtered_tb) from None File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/eager/execute.py", line 54, in quick_execute tensors = pywrap_tfe.TFE_Py_Execute(ctx._handle, device_name, op_name, tensorflow.python.framework.errors_impl.UnknownError: Graph execution error:

Detected at node 'DistributedGradientTape_Allreduce/cond_480/HorovodAllreduce_DistributedGradientTape_Allreduce_cond_480_Cast_1_0' defined at (most recent call last): File "train.py", line 336, in app.run(main) File "/usr/local/lib/python3.8/dist-packages/absl/app.py", line 303, in run _run_main(main, args) File "/usr/local/lib/python3.8/dist-packages/absl/app.py", line 251, in _run_main sys.exit(main(argv)) File "train.py", line 231, in main history = model.fit( File "/usr/local/lib/python3.8/dist-packages/keras/utils/traceback_utils.py", line 64, in error_handler return fn(*args, **kwargs) File "/usr/local/lib/python3.8/dist-packages/keras/engine/training.py", line 1384, in fit tmp_logs = self.train_function(iterator) File "/usr/local/lib/python3.8/dist-packages/keras/engine/training.py", line 1021, in train_function return step_function(self, iterator) File "/usr/local/lib/python3.8/dist-packages/keras/engine/training.py", line 1010, in step_function outputs = model.distribute_strategy.run(run_step, args=(data,)) File "/usr/local/lib/python3.8/dist-packages/keras/engine/training.py", line 1000, in run_step outputs = model.train_step(data) File "/workspace/effdet-tf2/utils/train_lib.py", line 388, in train_step scaled_gradients = tape.gradient(scaled_loss, trainable_vars) File "/usr/local/lib/python3.8/dist-packages/horovod/tensorflow/init.py", line 774, in gradient return self._allreduce_grads(gradients, sources) File "/usr/local/lib/python3.8/dist-packages/horovod/tensorflow/init.py", line 413, in allreduce_grads return [_allreduce_cond(grad, File "/usr/local/lib/python3.8/dist-packages/horovod/tensorflow/init.py", line 413, in return [_allreduce_cond(grad, File "/usr/local/lib/python3.8/dist-packages/horovod/tensorflow/init.py", line 253, in _allreduce_cond return tf.cond(tf.logical_and( File "/usr/local/lib/python3.8/dist-packages/horovod/tensorflow/init.py", line 248, in allreduce_fn return allreduce(tensor, *args, process_set=process_set, **kwargs) File "/usr/local/lib/python3.8/dist-packages/horovod/tensorflow/init.py", line 125, in allreduce summed_tensor_compressed = _allreduce(tensor_compressed, op=op, File "/usr/local/lib/python3.8/dist-packages/horovod/tensorflow/mpi_ops.py", line 123, in _allreduce return MPI_LIB.horovod_allreduce(tensor, name=name, reduce_op=op, File "", line 107, in horovod_allreduce _Node: 'DistributedGradientTape_Allreduce/cond_480/HorovodAllreduce_DistributedGradientTape_Allreduce_cond_480_Cast_1_0' ncclCommInitRank failed: internal error [[{{node DistributedGradientTape_Allreduce/cond_480/HorovodAllreduce_DistributedGradientTape_Allreduce_cond_480_Cast_1_0}}]] [Op:_inference_train_function_174717]

vasanth1986 avatar Jul 04 '23 14:07 vasanth1986