[Efficientdet/TF2] Training script to 4 GPUs systems
TensorFlow2/Detection/Efficientdet: Please help me to modify train.py or other *.py files to run on 4 GPUs systems
The convergence script (convergence-AMP-8xA100-80G.sh) is running without any issue on 8 GPUs systems. However, getting Missing ranks and Horovod internal error when running the same script with modification to CUDA_VISIBLE_DEVICES=0,1,2,3(4 GPUs).
Error Message:
[2023-06-27 10:46:24.525097: W /opt/tensorflow/horovod-source/horovod/common/stall_inspector.cc:107] One or more tensors were submitted to be reduced, gathered or broadcasted by subset of ranks and are waiting for remainder of ranks for more than 60 seconds. This may indicate that different ranks are trying to submit different tensors or that only subset of ranks is submitting tensors, which will cause deadlock.
Missing ranks:
0: [DistributedGradientTape_Allreduce/cond_212/then/_1696/DistributedGradientTape_Allreduce/cond_212/HorovodAllreduce_DistributedGradientTape_Allreduce_cond_212_Cast_1_0, DistributedGradientTape_Allreduce/cond_213/then/_1704/DistributedGradientTape_Allreduce/cond_213/HorovodAllreduce_DistributedGradientTape_Allreduce_cond_213_Cast_1_0, DistributedGradientTape_Allreduce/cond_214/then/_1712/DistributedGradientTape_Allreduce/cond_214/HorovodAllreduce_DistributedGradientTape_Allreduce_cond_214_Cast_1_0, DistributedGradientTape_Allreduce/cond_215/then/_1720/DistributedGradientTape_Allreduce/cond_215/HorovodAllreduce_DistributedGradientTape_Allreduce_cond_215_Cast_1_0, DistributedGradientTape_Allreduce/cond_216/then/_1728/DistributedGradientTape_Allreduce/cond_216/HorovodAllreduce_DistributedGradientTape_Allreduce_cond_216_Cast_1_0, DistributedGradientTape_Allreduce/cond_217/then/_1736/DistributedGradientTape_Allreduce/cond_217/HorovodAllreduce_DistributedGradientTape_Allreduce_cond_217_Cast_1_0 ...]
1: [DistributedGradientTape_Allreduce/cond_212/then/_1696/DistributedGradientTape_Allreduce/cond_212/HorovodAllreduce_DistributedGradientTape_Allreduce_cond_212_Cast_1_0, DistributedGradientTape_Allreduce/cond_213/then/_1704/DistributedGradientTape_Allreduce/cond_213/HorovodAllreduce_DistributedGradientTape_Allreduce_cond_213_Cast_1_0, DistributedGradientTape_Allreduce/cond_214/then/_1712/DistributedGradientTape_Allreduce/cond_214/HorovodAllreduce_DistributedGradientTape_Allreduce_cond_214_Cast_1_0, DistributedGradientTape_Allreduce/cond_215/then/_1720/DistributedGradientTape_Allreduce/cond_215/HorovodAllreduce_DistributedGradientTape_Allreduce_cond_215_Cast_1_0, DistributedGradientTape_Allreduce/cond_216/then/_1728/DistributedGradientTape_Allreduce/cond_216/HorovodAllreduce_DistributedGradientTape_Allreduce_cond_216_Cast_1_0, DistributedGradientTape_Allreduce/cond_217/then/_1736/DistributedGradientTape_Allreduce/cond_217/HorovodAllreduce_DistributedGradientTape_Allreduce_cond_217_Cast_1_0 ...]
2: [DistributedGradientTape_Allreduce/cond_212/then/_1696/DistributedGradientTape_Allreduce/cond_212/HorovodAllreduce_DistributedGradientTape_Allreduce_cond_212_Cast_1_0, DistributedGradientTape_Allreduce/cond_213/then/_1704/DistributedGradientTape_Allreduce/cond_213/HorovodAllreduce_DistributedGradientTape_Allreduce_cond_213_Cast_1_0, DistributedGradientTape_Allreduce/cond_214/then/_1712/DistributedGradientTape_Allreduce/cond_214/HorovodAllreduce_DistributedGradientTape_Allreduce_cond_214_Cast_1_0, DistributedGradientTape_Allreduce/cond_215/then/_1720/DistributedGradientTape_Allreduce/cond_215/HorovodAllreduce_DistributedGradientTape_Allreduce_cond_215_Cast_1_0, DistributedGradientTape_Allreduce/cond_216/then/_1728/DistributedGradientTape_Allreduce/cond_216/HorovodAllreduce_DistributedGradientTape_Allreduce_cond_216_Cast_1_0, DistributedGradientTape_Allreduce/cond_217/then/_1736/DistributedGradientTape_Allreduce/cond_217/HorovodAllreduce_DistributedGradientTape_Allreduce_cond_217_Cast_1_0 ...]
Traceback (most recent call last):
File "train.py", line 336, in
Detected at node 'DistributedGradientTape_Allreduce/cond_480/HorovodAllreduce_DistributedGradientTape_Allreduce_cond_480_Cast_1_0' defined at (most recent call last):
File "train.py", line 336, in