softlearning Restore does not work at all anymore

Hi, since I updated the code lately it seems like adding --restore==Pathtocheckpoint doesn't do anything...

Apr 30 '19 08:04 kapsl

I still have this issue? I tried debuggin it, but with no success so far. It seems to be wrapped pretty deeply into ray?!

Jun 04 '19 13:06 kapsl

I finally got it running but probably this is not the ideal solution:

give program argument --restore="path to the experiment (not the tensorboard checkpoint)"

examples.instrument.py Line 218 Insert:

resume = False
    if 'restore' in experiment_kwargs:
        resume = True

below in the function call tune.run give resume=resume) as last argument.

In ray/tune/tune.py Line 202 insert:

    if resume:
        checkpoint_dir = restore

In rl_algorithm.py Line 241 insert: training_environment.reset() without it we eventually get problems because the first action that is taken is a step action.

And finally you need to look into your experiment folder into the experiment-state.json files and eventually set status from ERROR to RUNNING

Jun 04 '19 14:06 kapsl

Hey @kapsl, sorry for the delayed response. I tried replicating this issue but I'm able to resume trials on my end. Here's what I run:

softlearning launch_example_debug examples.development \
  --universe=gym \
  --domain=Swimmer \
  --task=v3 \
  --exp-name="checkpoint-test-1" \
  --checkpoint-frequency=1

The above command saves the trial things into ~/ray_results/gym/Swimmer/v3/2019-06-06T09-47-54-checkpoint-test-1/id=36f74991-seed=4411_2019-06-06_09-47-541swaqw7v/

softlearning launch_example_debug examples.development \
    --universe=gym
    --domain=Swimmer
    --task=v3
    --exp-name="checkpoint-test-1"
    --num-samples=4
    --trial-cpus=2
    --trial-gpus=0.25
    --checkpoint-frequency=1
    --checkpoint-at-end=False
    --video-save-frequency=25
    --with-server=False
    --restore=~/ray_results/gym/Swimmer/v3/2019-06-06T09-47-54-checkpoint-test-1/id=36f74991-seed=4411_2019-06-06_09-47-541swaqw7v/checkpoint_1/

Could you try running similar minimal example and see if that works? I'm happy to help you debug the issue if you provide more information on the issue or (preferably) a minimal example to replicate the issue.

Jun 06 '19 16:06 hartikainen

Hi, for me that actually doesn't work. It starts with the issue, if I provide a concrete checkpoint like checkpoiint_600 with --restore it doesn't find any experiment because it tries to find the .json file.

I currently also had the strange situation, that with my codechanges I made restore work, but sometimes it just doesn't restore. It says it restores from the folder, no error message etc. but then it just start training from 0.

Jul 22 '19 11:07 kapsl

Ok I finally found the problem:

I reverted all my changes and as you said it is working the problem is.
the exp-name of the trial and the restore must match exactly!
if you start the experiment it complains that it has not found a checkpoint file and starts a new experiment. But that is not true, actually it is continuing the old one, but creating a new results folder...

Jul 23 '19 07:07 kapsl