SIMPLE icon indicating copy to clipboard operation
SIMPLE copied to clipboard

Multiprocessor Training Resulted in Corrupted best_model.zip

Open winthropharvey opened this issue 4 years ago • 0 comments

I left a model training overnight with

docker-compose exec app mpirun -np 8 python3 train.py -e connect4

But after just an hour or so it crashed with error:

A load persistent id instruction was encountered, but no persistent_load function was specified.

Subsequently, I could not restart training, as whenever the program attempted to load best_model.zip it produced the same error. Investigation revealed that the best_model.zip file had somehow become malformed/corrupted. I had to replace it with a prior saved model in order to resume training.

winthropharvey avatar Sep 12 '21 16:09 winthropharvey