Resume training
Hi, I am trying resuming a training and I think this works over the --restore parameter? But when I try this I get the error message that a file with ...tune_metadata was not found. And indeed in my checkpoints is no file with this ending? What is the best way to resume experiments?!
What's the exact command that you're running on restore? Can you try adding / to the end of the restore string like --restore=".../". I think omitting the / might cause an error even if .tune_metadata exists.
Regarding the missing .tune_metadata, does you trial directories still have checkpoint directories (i.e. directories of the form checkpoint_X) in them? If not, then maybe you didn't have checkpointing enabled when running your algorithm. To enable it, you should have --checkpoint-frequency set to > 1 or --checkpoint-at-end set to true.
Hello, adding / to the end helps restoring there are also checkpoints - but the problem I'm facing now is, that all the environment does not set up correctly. I have the impression that at the beginning there is no env.reset() called, which causes problems when directly trying a step?!
Can you paste the error you're seeing here? It might help me figure out what the problem is.
The error is specific to my environment. It is caused because reset() is not called initially...
Makes sense! I'll close this issue for now. Feel free open it up again if you think it's an issue in softlearning.
Yes i think its an issue in softlearning. Shlouldnt be the first thing to do after restoring calling reset in the environment?
Ah, I see now. The problem probably happens when we save the sampler with self._current_observation is not None and then try to resume. In this case the environment would not get reset in the beginning of the training. I'll write a fix for this soon.