orqa ict training it seems stop at evaluation
hi guys ,i was running on centos and tf2.1 4 v100 GPUS
python -m language.orqa.experiments.ict_experiment --model_dir=/home/wjd/workspace/Reference_Repority/language_model_master/language/orqa/ict_model --bert_hub_module_path=/home/wjd/workspace/Reference_Repority/language_model_master/language/orqa/bert_tf_hub/bert_uncased_L-12_H-768_A-12_1 --examples_path=/home/wjd/workspace/Reference_Repority/language_model_master/wikipedia_base/examples.tfr --batch_size=32 --num_train_steps=100000 --use_tpu=False --save_checkpoints_steps=100
the following logs
Instructions for updating: Use standard file utilities to get mtimes. W0928 17:19:43.754457 139878239135552 deprecation.py:323] From /home/wuxibio/workspace/yes/envs/tf2.1/lib/python3.6/site-packages/tensorflow_core/python/training/saver.py:1069: get_checkpoint_mtimes (from tensorflow.python.training.checkpoint_management) is deprecated and will be removed in a future version. Instructions for updating: Use standard file utilities to get mtimes. INFO:tensorflow:Running local_init_op. I0928 17:19:44.931515 139878239135552 session_manager.py:504] Running local_init_op. INFO:tensorflow:Done running local_init_op. I0928 17:19:45.473590 139878239135552 session_manager.py:507] Done running local_init_op. INFO:tensorflow:Saving checkpoints for 100 into /home/wuxibio/workspace/Reference_Repority/language_model_master/language/orqa/ict_model/model.ckpt. I0928 17:19:59.798972 139878239135552 basic_session_run_hooks.py:613] Saving checkpoints for 100 into /home/wuxibio/workspace/Reference_Repority/language_model_master/language/orqa/ict_model/model.ckpt. INFO:tensorflow:loss = 11.514843, step = 100 I0928 17:20:21.271094 139878239135552 basic_session_run_hooks.py:262] loss = 11.514843, step = 100 INFO:tensorflow:Saving checkpoints for 200 into /home/wuxibio/workspace/Reference_Repority/language_model_master/language/orqa/ict_model/model.ckpt. I0928 17:22:00.624361 139878239135552 basic_session_run_hooks.py:613] Saving checkpoints for 200 into /home/wuxibio/workspace/Reference_Repority/language_model_master/language/orqa/ict_model/model.ckpt. INFO:tensorflow:Concurrent reads of 14000000 records: [ 758683 1205631 3034394 3654102 4839897 6913069 7508564 9310926 9337631 10880908 12750259 13726628] I0928 17:22:03.979343 139878239135552 ict_dataset.py:110] Concurrent reads of 14000000 records: [ 758683 1205631 3034394 3654102 4839897 6913069 7508564 9310926 9337631 10880908 12750259 13726628] I0928 17:22:03.996259 139878239135552 registry.py:47] resolver HttpCompressedFileResolver does not support the provided handle. I0928 17:22:03.996369 139878239135552 registry.py:47] resolver GcsCompressedFileResolver does not support the provided handle. INFO:tensorflow:Saver not created because there are no variables in the graph to restore I0928 17:22:05.344879 139878239135552 saver.py:1503] Saver not created because there are no variables in the graph to restore {'do_lower_case': True, 'vocab_file': b'/home/wuxibio/workspace/Reference_Repority/language_model_master/language/orqa/bert_tf_hub/bert_uncased_L-12_H-768_A-12_1/assets/vocab.txt'} INFO:tensorflow:Calling model_fn. I0928 17:22:05.836602 139878239135552 estimator.py:1151] Calling model_fn. INFO:tensorflow:Model batch size: 32 I0928 17:22:05.837093 139878239135552 ict_model.py:97] Model batch size: 32 I0928 17:22:05.838134 139878239135552 registry.py:47] resolver HttpCompressedFileResolver does not support the provided handle. I0928 17:22:05.838222 139878239135552 registry.py:47] resolver GcsCompressedFileResolver does not support the provided handle. INFO:tensorflow:Saver not created because there are no variables in the graph to restore I0928 17:22:07.403385 139878239135552 saver.py:1503] Saver not created because there are no variables in the graph to restore INFO:tensorflow:Saver not created because there are no variables in the graph to restore I0928 17:22:07.912423 139878239135552 saver.py:1503] Saver not created because there are no variables in the graph to restore I0928 17:22:08.796733 139878239135552 registry.py:47] resolver HttpCompressedFileResolver does not support the provided handle. I0928 17:22:08.796917 139878239135552 registry.py:47] resolver GcsCompressedFileResolver does not support the provided handle. INFO:tensorflow:Saver not created because there are no variables in the graph to restore I0928 17:22:10.025779 139878239135552 saver.py:1503] Saver not created because there are no variables in the graph to restore INFO:tensorflow:Saver not created because there are no variables in the graph to restore I0928 17:22:10.457150 139878239135552 saver.py:1503] Saver not created because there are no variables in the graph to restore INFO:tensorflow:Saver not created because there are no variables in the graph to restore I0928 17:22:14.178668 139878239135552 saver.py:1503] Saver not created because there are no variables in the graph to restore INFO:tensorflow:Saver not created because there are no variables in the graph to restore I0928 17:22:15.833999 139878239135552 saver.py:1503] Saver not created because there are no variables in the graph to restore INFO:tensorflow:Global batch size: 32 I0928 17:22:16.123959 139878239135552 ict_model.py:127] Global batch size: 32 /home/wuxibio/workspace/yes/envs/tf2.1/lib/python3.6/site-packages/tensorflow_core/python/framework/indexed_slices.py:433: UserWarning: Converting sparse IndexedSlices to a dense Tensor of unknown shape. This may consume a large amount of memory. "Converting sparse IndexedSlices to a dense Tensor of unknown shape. " INFO:tensorflow:Done calling model_fn. I0928 17:22:20.287163 139878239135552 estimator.py:1153] Done calling model_fn. INFO:tensorflow:Starting evaluation at 2020-09-28T17:22:20Z I0928 17:22:20.297024 139878239135552 evaluation.py:255] Starting evaluation at 2020-09-28T17:22:20Z INFO:tensorflow:Graph was finalized. I0928 17:22:20.971659 139878239135552 monitored_session.py:246] Graph was finalized. INFO:tensorflow:Restoring parameters from /home/wuxibio/workspace/Reference_Repority/language_model_master/language/orqa/ict_model/model.ckpt-200 I0928 17:22:20.997153 139878239135552 saver.py:1284] Restoring parameters from /home/wuxibio/workspace/Reference_Repority/language_model_master/language/orqa/ict_model/model.ckpt-200 INFO:tensorflow:Running local_init_op. I0928 17:22:25.253638 139878239135552 session_manager.py:504] Running local_init_op. INFO:tensorflow:Done running local_init_op. I0928 17:22:25.762617 139878239135552 session_manager.py:507] Done running local_init_op.
it was stop there and nothing happened after that .
I have same issue. Did you manage to resolve it?