automl icon indicating copy to clipboard operation
automl copied to clipboard

GPU's Strategy Slow Model Intialization

Open matankley opened this issue 5 years ago • 2 comments

Hi, Thanks for this amazing repo, I successfully trained a model on custom data and achieved great results. But when i try training on 2 gpu's using --strategy==gpus i met a problem. It takes like 15 minutes for the graph to be uploaded to the gpu's (until i see high value of memory occupied via nvidia-smi) Afterwards the code is stuck after printing this lines: I1029 11:19:30.220527 140098773014272 session_manager.py:505] Running local_init_op. INFO:tensorflow:Done running local_init_op. I1029 11:19:36.538457 140098773014272 session_manager.py:508] Done running local_init_op. INFO:tensorflow:Calling checkpoint listeners before saving checkpoint 0... I1029 11:22:55.363960 140098773014272 basic_session_run_hooks.py:614] Calling checkpoint listeners before saving checkpoint 0... INFO:tensorflow:Saving checkpoints for 0 into /tmp/model_dir/efficientdet-d3-codeless_4batch_180epoch_4multigpu/model.ckpt. I1029 11:22:55.365018 140098773014272 basic_session_run_hooks.py:618] Saving checkpoints for 0 into /tmp/model_dir/efficientdet-d3-codeless_4batch_180epoch_4multigpu/model.ckpt. INFO:tensorflow:Calling checkpoint listeners after saving checkpoint 0... I1029 11:23:37.064225 140098773014272 basic_session_run_hooks.py:626] Calling checkpoint listeners after saving checkpoint 0...

This is the command i run:

python main.py --mode=train \ --training_file_pattern={folder_records}/{train_pattern} \ --model_name={MODEL} \ --model_dir=/tmp/model_dir/{MODEL}-codeless_8atch_180epoch_2multigpu \ --train_batch_size=8 \ --num_examples_per_epoch=3168 --num_epochs=180 \ --hparams=data/codeless_config.yaml \ --strategy=gpus

Anybody met this issue and has an advice please ? Thanks

matankley avatar Oct 29 '20 11:10 matankley

This is the paramters config file: num_classes: 99 anchor_scale: 1.0 label_map: {*****, dict of 98 classes'}

matankley avatar Oct 29 '20 11:10 matankley

Same issue. Have you solve it?

MinZhangm avatar Mar 09 '23 03:03 MinZhangm