model Unable to continue training from checkpoint.

I am trying to run some more training loops for a specific region, using this notebook.

I was not happy with the clustering: Screenshot 2024-01-26 at 12 05 35

So I wanted to run a few epochs only on my target areag.

When I do so, with

!python trainer.py fit --trainer.max_epochs=100 \
                       --data.data_dir=data/chips \
                       --ckpt_path=data/checkpoints/Clay_v0.1_epoch-24_val-loss-0.46.ckpt

I get this error:

Seed set to 42
Using bfloat16 Automatic Mixed Precision (AMP)
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
[rank: 0] Seed set to 42
Initializing distributed: GLOBAL_RANK: 0, MEMBER: 1/1
----------------------------------------------------------------------------------------------------
distributed_backend=nccl
All distributed processes registered. Starting with 1 processes
----------------------------------------------------------------------------------------------------

Total number of chips: 1102
/home/brunosan/miniforge3/envs/claymodel/lib/python3.11/site-packages/lightning/pytorch/callbacks/model_checkpoint.py:639: Checkpoint directory /home/brunosan/code/Clay/model/checkpoints exists and is not empty.
Restoring states from the checkpoint path at data/checkpoints/Clay_v0.1_epoch-24_val-loss-0.46.ckpt
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]

  | Name  | Type | Params
-------------------------------
0 | model | CLAY | 127 M 
-------------------------------
127 M     Trainable params
0         Non-trainable params
127 M     Total params
510.809   Total estimated model params size (MB)
Traceback (most recent call last):
  File "/home/brunosan/code/Clay/model/trainer.py", line 77, in <module>
    cli_main()
  File "/home/brunosan/code/Clay/model/trainer.py", line 64, in cli_main
    cli = LightningCLI(
          ^^^^^^^^^^^^^
  File "/home/brunosan/miniforge3/envs/claymodel/lib/python3.11/site-packages/lightning/pytorch/cli.py", line 386, in __init__
    self._run_subcommand(self.subcommand)
  File "/home/brunosan/miniforge3/envs/claymodel/lib/python3.11/site-packages/lightning/pytorch/cli.py", line 677, in _run_subcommand
    fn(**fn_kwargs)
  File "/home/brunosan/miniforge3/envs/claymodel/lib/python3.11/site-packages/lightning/pytorch/trainer/trainer.py", line 544, in fit
    call._call_and_handle_interrupt(
  File "/home/brunosan/miniforge3/envs/claymodel/lib/python3.11/site-packages/lightning/pytorch/trainer/call.py", line 43, in _call_and_handle_interrupt
    return trainer.strategy.launcher.launch(trainer_fn, *args, trainer=trainer, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/brunosan/miniforge3/envs/claymodel/lib/python3.11/site-packages/lightning/pytorch/strategies/launchers/subprocess_script.py", line 102, in launch
    return function(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/brunosan/miniforge3/envs/claymodel/lib/python3.11/site-packages/lightning/pytorch/trainer/trainer.py", line 580, in _fit_impl
    self._run(model, ckpt_path=ckpt_path)
  File "/home/brunosan/miniforge3/envs/claymodel/lib/python3.11/site-packages/lightning/pytorch/trainer/trainer.py", line 980, in _run
    self._checkpoint_connector.restore_training_state()
  File "/home/brunosan/miniforge3/envs/claymodel/lib/python3.11/site-packages/lightning/pytorch/trainer/connectors/checkpoint_connector.py", line 296, in restore_training_state
    self.restore_optimizers_and_schedulers()
  File "/home/brunosan/miniforge3/envs/claymodel/lib/python3.11/site-packages/lightning/pytorch/trainer/connectors/checkpoint_connector.py", line 362, in restore_optimizers_and_schedulers
    raise KeyError(
KeyError: 'Trying to restore optimizer state but checkpoint contains only the model. This is probably due to `ModelCheckpoint.save_weights_only` being set to `True`.'

Jan 26 '24 11:01 brunosan

Looks like we are indeed only saving the weights. Not sure if that means we can not continue training or if there is a workaround. @weiji14 and @srmsoumya ? https://github.com/Clay-foundation/model/blob/b8aa8cdce9fe56b93ca6e40cc0139414511ae79b/trainer.py#L50

Feb 05 '24 10:02 yellowcap

Yeah, we did not save the AdamW optimizer state, so it won't be possible to resume training from that checkpoint using the AdamW optimizer, or any adaptive optimization algorithms. It might be possible to resume training using non-adaptive optimizers such as Stochastic Gradient Descent, but it would require a lot of manual handling of the checkpoint loading, so not a straightforward workaround.

That said, the original objective seems to be on finetuning the checkpoint on a specific region, rather than resuming the self-supervised training. The entrypoint shouldn't be trainer.py, but a separate finetuning script (which could technically still use elements of the MAE training loop).

Feb 11 '24 23:02 weiji14

Main use case is to resume training if halted (e.g. we were using Spot instances), but I can see use cases where a regional user might want to continue training with regional data.

If we chose not to save the optimizers, we should document how to resume training with new initialized optimizers.

Mar 01 '24 23:03 brunosan

I agree we should have a way to resume training for the checkpoints we save (or at least the last one), if that is technically possible and won't slow down training too much.

Mar 04 '24 09:03 yellowcap

We have addressed this for v0.2 and will also for v1, by storing the optimizer state during training. So I am closing this, but feel free to reopen if this is an issue that persists for future versions of the model.

Mar 15 '24 11:03 yellowcap

Not saving the optimizer remains the default. https://github.com/Clay-foundation/model/blob/50094baba3e0ec71a97d1abc7116e7b308d5986d/trainer.py#L51

Mar 25 '24 12:03 brunosan

Addressed in PR #193

Mar 26 '24 05:03 srmsoumya