Restore does not work at all anymore
Hi, since I updated the code lately it seems like adding --restore==Pathtocheckpoint doesn't do anything...
I still have this issue? I tried debuggin it, but with no success so far. It seems to be wrapped pretty deeply into ray?!
I finally got it running but probably this is not the ideal solution:
give program argument --restore="path to the experiment (not the tensorboard checkpoint)"
examples.instrument.py Line 218 Insert:
resume = False
if 'restore' in experiment_kwargs:
resume = True
below in the function call tune.run give
resume=resume)
as last argument.
In ray/tune/tune.py Line 202 insert:
if resume:
checkpoint_dir = restore
In rl_algorithm.py Line 241 insert:
training_environment.reset()
without it we eventually get problems because the first action that is taken is a step action.
And finally you need to look into your experiment folder into the experiment-state.json files and eventually set status from ERROR to RUNNING
Hey @kapsl, sorry for the delayed response. I tried replicating this issue but I'm able to resume trials on my end. Here's what I run:
softlearning launch_example_debug examples.development \
--universe=gym \
--domain=Swimmer \
--task=v3 \
--exp-name="checkpoint-test-1" \
--checkpoint-frequency=1
The above command saves the trial things into ~/ray_results/gym/Swimmer/v3/2019-06-06T09-47-54-checkpoint-test-1/id=36f74991-seed=4411_2019-06-06_09-47-541swaqw7v/
softlearning launch_example_debug examples.development \
--universe=gym
--domain=Swimmer
--task=v3
--exp-name="checkpoint-test-1"
--num-samples=4
--trial-cpus=2
--trial-gpus=0.25
--checkpoint-frequency=1
--checkpoint-at-end=False
--video-save-frequency=25
--with-server=False
--restore=~/ray_results/gym/Swimmer/v3/2019-06-06T09-47-54-checkpoint-test-1/id=36f74991-seed=4411_2019-06-06_09-47-541swaqw7v/checkpoint_1/
Could you try running similar minimal example and see if that works? I'm happy to help you debug the issue if you provide more information on the issue or (preferably) a minimal example to replicate the issue.
Hi, for me that actually doesn't work. It starts with the issue, if I provide a concrete checkpoint like checkpoiint_600 with --restore it doesn't find any experiment because it tries to find the .json file.
I currently also had the strange situation, that with my codechanges I made restore work, but sometimes it just doesn't restore. It says it restores from the folder, no error message etc. but then it just start training from 0.
Ok I finally found the problem:
-
I reverted all my changes and as you said it is working the problem is.
-
the exp-name of the trial and the restore must match exactly!
-
if you start the experiment it complains that it has not found a checkpoint file and starts a new experiment. But that is not true, actually it is continuing the old one, but creating a new results folder...