GPU's Strategy Slow Model Intialization
Hi,
Thanks for this amazing repo,
I successfully trained a model on custom data and achieved great results.
But when i try training on 2 gpu's using --strategy==gpus i met a problem.
It takes like 15 minutes for the graph to be uploaded to the gpu's (until i see high value of memory occupied via nvidia-smi)
Afterwards the code is stuck after printing this lines:
I1029 11:19:30.220527 140098773014272 session_manager.py:505] Running local_init_op. INFO:tensorflow:Done running local_init_op. I1029 11:19:36.538457 140098773014272 session_manager.py:508] Done running local_init_op. INFO:tensorflow:Calling checkpoint listeners before saving checkpoint 0... I1029 11:22:55.363960 140098773014272 basic_session_run_hooks.py:614] Calling checkpoint listeners before saving checkpoint 0... INFO:tensorflow:Saving checkpoints for 0 into /tmp/model_dir/efficientdet-d3-codeless_4batch_180epoch_4multigpu/model.ckpt. I1029 11:22:55.365018 140098773014272 basic_session_run_hooks.py:618] Saving checkpoints for 0 into /tmp/model_dir/efficientdet-d3-codeless_4batch_180epoch_4multigpu/model.ckpt. INFO:tensorflow:Calling checkpoint listeners after saving checkpoint 0... I1029 11:23:37.064225 140098773014272 basic_session_run_hooks.py:626] Calling checkpoint listeners after saving checkpoint 0...
This is the command i run:
python main.py --mode=train \ --training_file_pattern={folder_records}/{train_pattern} \ --model_name={MODEL} \ --model_dir=/tmp/model_dir/{MODEL}-codeless_8atch_180epoch_2multigpu \ --train_batch_size=8 \ --num_examples_per_epoch=3168 --num_epochs=180 \ --hparams=data/codeless_config.yaml \ --strategy=gpus
Anybody met this issue and has an advice please ? Thanks
This is the paramters config file:
num_classes: 99 anchor_scale: 1.0 label_map: {*****, dict of 98 classes'}
Same issue. Have you solve it?