Unable to continue training from checkpoint.
I am trying to run some more training loops for a specific region, using this notebook.
I was not happy with the clustering:
So I wanted to run a few epochs only on my target areag.
When I do so, with
!python trainer.py fit --trainer.max_epochs=100 \
--data.data_dir=data/chips \
--ckpt_path=data/checkpoints/Clay_v0.1_epoch-24_val-loss-0.46.ckpt
I get this error:
Seed set to 42
Using bfloat16 Automatic Mixed Precision (AMP)
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
[rank: 0] Seed set to 42
Initializing distributed: GLOBAL_RANK: 0, MEMBER: 1/1
----------------------------------------------------------------------------------------------------
distributed_backend=nccl
All distributed processes registered. Starting with 1 processes
----------------------------------------------------------------------------------------------------
Total number of chips: 1102
/home/brunosan/miniforge3/envs/claymodel/lib/python3.11/site-packages/lightning/pytorch/callbacks/model_checkpoint.py:639: Checkpoint directory /home/brunosan/code/Clay/model/checkpoints exists and is not empty.
Restoring states from the checkpoint path at data/checkpoints/Clay_v0.1_epoch-24_val-loss-0.46.ckpt
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
| Name | Type | Params
-------------------------------
0 | model | CLAY | 127 M
-------------------------------
127 M Trainable params
0 Non-trainable params
127 M Total params
510.809 Total estimated model params size (MB)
Traceback (most recent call last):
File "/home/brunosan/code/Clay/model/trainer.py", line 77, in <module>
cli_main()
File "/home/brunosan/code/Clay/model/trainer.py", line 64, in cli_main
cli = LightningCLI(
^^^^^^^^^^^^^
File "/home/brunosan/miniforge3/envs/claymodel/lib/python3.11/site-packages/lightning/pytorch/cli.py", line 386, in __init__
self._run_subcommand(self.subcommand)
File "/home/brunosan/miniforge3/envs/claymodel/lib/python3.11/site-packages/lightning/pytorch/cli.py", line 677, in _run_subcommand
fn(**fn_kwargs)
File "/home/brunosan/miniforge3/envs/claymodel/lib/python3.11/site-packages/lightning/pytorch/trainer/trainer.py", line 544, in fit
call._call_and_handle_interrupt(
File "/home/brunosan/miniforge3/envs/claymodel/lib/python3.11/site-packages/lightning/pytorch/trainer/call.py", line 43, in _call_and_handle_interrupt
return trainer.strategy.launcher.launch(trainer_fn, *args, trainer=trainer, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/brunosan/miniforge3/envs/claymodel/lib/python3.11/site-packages/lightning/pytorch/strategies/launchers/subprocess_script.py", line 102, in launch
return function(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/brunosan/miniforge3/envs/claymodel/lib/python3.11/site-packages/lightning/pytorch/trainer/trainer.py", line 580, in _fit_impl
self._run(model, ckpt_path=ckpt_path)
File "/home/brunosan/miniforge3/envs/claymodel/lib/python3.11/site-packages/lightning/pytorch/trainer/trainer.py", line 980, in _run
self._checkpoint_connector.restore_training_state()
File "/home/brunosan/miniforge3/envs/claymodel/lib/python3.11/site-packages/lightning/pytorch/trainer/connectors/checkpoint_connector.py", line 296, in restore_training_state
self.restore_optimizers_and_schedulers()
File "/home/brunosan/miniforge3/envs/claymodel/lib/python3.11/site-packages/lightning/pytorch/trainer/connectors/checkpoint_connector.py", line 362, in restore_optimizers_and_schedulers
raise KeyError(
KeyError: 'Trying to restore optimizer state but checkpoint contains only the model. This is probably due to `ModelCheckpoint.save_weights_only` being set to `True`.'
Looks like we are indeed only saving the weights. Not sure if that means we can not continue training or if there is a workaround. @weiji14 and @srmsoumya ? https://github.com/Clay-foundation/model/blob/b8aa8cdce9fe56b93ca6e40cc0139414511ae79b/trainer.py#L50
Yeah, we did not save the AdamW optimizer state, so it won't be possible to resume training from that checkpoint using the AdamW optimizer, or any adaptive optimization algorithms. It might be possible to resume training using non-adaptive optimizers such as Stochastic Gradient Descent, but it would require a lot of manual handling of the checkpoint loading, so not a straightforward workaround.
That said, the original objective seems to be on finetuning the checkpoint on a specific region, rather than resuming the self-supervised training. The entrypoint shouldn't be trainer.py, but a separate finetuning script (which could technically still use elements of the MAE training loop).
Main use case is to resume training if halted (e.g. we were using Spot instances), but I can see use cases where a regional user might want to continue training with regional data.
If we chose not to save the optimizers, we should document how to resume training with new initialized optimizers.
I agree we should have a way to resume training for the checkpoints we save (or at least the last one), if that is technically possible and won't slow down training too much.
We have addressed this for v0.2 and will also for v1, by storing the optimizer state during training. So I am closing this, but feel free to reopen if this is an issue that persists for future versions of the model.
Not saving the optimizer remains the default. https://github.com/Clay-foundation/model/blob/50094baba3e0ec71a97d1abc7116e7b308d5986d/trainer.py#L51
Addressed in PR #193