automl
automl copied to clipboard
MultiWorkerMirrorStrategy for distributed training not working in gpus
Hi
I am using MultiWorkerMirrorStrategy and tf.estimator.train_and_evaluate for distributed training with 3 epoch.
Please find below the information:
GPU: 4 x NVIDIA Tesla V100
Datasets: COCOA
Model: Efficientdet-d5
Tensorflow: 2.4.0-gpu
Error when trying to implement this model:
Bad status from CompleteGroupDistributed: Failed precondition: Device /job:worker/replica:0/task:1/device:GPU:0 current incarnation doesn't match with one in the group. This usually means this worker has restarted but the collective leader hasn't, or this worker connects to a wrong cluster.
I have changed some of the few lines in main.py file


FYI: Using only train mode
Same error here. Has your problem been solved?