automl icon indicating copy to clipboard operation
automl copied to clipboard

MultiWorkerMirrorStrategy for distributed training not working in gpus

Open ankur47 opened this issue 5 years ago • 1 comments

Hi I am using MultiWorkerMirrorStrategy and tf.estimator.train_and_evaluate for distributed training with 3 epoch. Please find below the information:

GPU: 4 x NVIDIA Tesla V100
Datasets: COCOA 
Model: Efficientdet-d5
Tensorflow: 2.4.0-gpu

Error when trying to implement this model: Bad status from CompleteGroupDistributed: Failed precondition: Device /job:worker/replica:0/task:1/device:GPU:0 current incarnation doesn't match with one in the group. This usually means this worker has restarted but the collective leader hasn't, or this worker connects to a wrong cluster.

I have changed some of the few lines in main.py file

image

image

FYI: Using only train mode

ankur47 avatar Mar 15 '21 14:03 ankur47

Same error here. Has your problem been solved?

DirkFi avatar Jan 05 '22 06:01 DirkFi