softlearning icon indicating copy to clipboard operation
softlearning copied to clipboard

Resume training

Open kapsl opened this issue 6 years ago • 7 comments

Hi, I am trying resuming a training and I think this works over the --restore parameter? But when I try this I get the error message that a file with ...tune_metadata was not found. And indeed in my checkpoints is no file with this ending? What is the best way to resume experiments?!

kapsl avatar Apr 03 '19 07:04 kapsl

What's the exact command that you're running on restore? Can you try adding / to the end of the restore string like --restore=".../". I think omitting the / might cause an error even if .tune_metadata exists.

Regarding the missing .tune_metadata, does you trial directories still have checkpoint directories (i.e. directories of the form checkpoint_X) in them? If not, then maybe you didn't have checkpointing enabled when running your algorithm. To enable it, you should have --checkpoint-frequency set to > 1 or --checkpoint-at-end set to true.

hartikainen avatar Apr 04 '19 02:04 hartikainen

Hello, adding / to the end helps restoring there are also checkpoints - but the problem I'm facing now is, that all the environment does not set up correctly. I have the impression that at the beginning there is no env.reset() called, which causes problems when directly trying a step?!

kapsl avatar Apr 04 '19 09:04 kapsl

Can you paste the error you're seeing here? It might help me figure out what the problem is.

hartikainen avatar Apr 05 '19 00:04 hartikainen

The error is specific to my environment. It is caused because reset() is not called initially...

kapsl avatar Apr 05 '19 04:04 kapsl

Makes sense! I'll close this issue for now. Feel free open it up again if you think it's an issue in softlearning.

hartikainen avatar Apr 05 '19 19:04 hartikainen

Yes i think its an issue in softlearning. Shlouldnt be the first thing to do after restoring calling reset in the environment?

kapsl avatar Apr 06 '19 07:04 kapsl

Ah, I see now. The problem probably happens when we save the sampler with self._current_observation is not None and then try to resume. In this case the environment would not get reset in the beginning of the training. I'll write a fix for this soon.

hartikainen avatar Apr 06 '19 19:04 hartikainen