ecosystem icon indicating copy to clipboard operation
ecosystem copied to clipboard

Checkpoint file cannot be found error on Kubernetes cluster

Open shouhong opened this issue 9 years ago • 4 comments

I followed the instructions from https://github.com/tensorflow/ecosystem/tree/master/kubernetes to run mnist sample on kubernetes with 2 worker and 2 ps. Then I got below errors. Do you have any idea what's the possible reason? Thanks!

  1. Error on ps-1: Not found: /train_dir/model.ckpt-0_temp_5fcd97c881a7428db8eded829b964618/part-00000-of-00002.index

  2. Error on worker-0: NotFoundError (see above for traceback): /train_dir/model.ckpt-0_temp_5fcd97c881a7428db8eded829b964618/part-00000-of-00002.index [[Node: save/MergeV2Checkpoints = MergeV2Checkpoints[delete_old_dirs=true, _device="/job:ps/replica:0/task:1/cpu:0"](save/MergeV2Checkpoints/checkpoint_prefixes, _recv_save/Const_0_S81)]]

The mnist.py used is from the https://github.com/tensorflow/ecosystem/tree/master/docker

shouhong avatar Jan 12 '17 09:01 shouhong

I have the same problem.

drinktee avatar Mar 15 '17 10:03 drinktee

Are you setting the train_dir to a local directory? It must be a directory visible to all workers.

jhseu avatar Mar 15 '17 18:03 jhseu

Are you setting the train_dir to a local directory? It must be a directory visible to all workers.

Yes,I set the train dir to a local directory.Let me try it again...

drinktee avatar Mar 17 '17 02:03 drinktee

I have the same problem. But i use "google colab" and google drive ocaml fuse not in a distributed fashion. How can i solve this problem ? It occurs on some models. Maybe it depends on the model structure.

Is there any way to change the temporary directory for checkpoint saver ?

aligokalppeker avatar May 24 '18 20:05 aligokalppeker