🐛 Describe the bug
When I add --resume parameter to load a last.ckpt file(stable diffusion), the train runs error as follows:
/home/ubuntu/anaconda3/envs/ldmco/lib/python3.9/site-packages/pytorch_lightning/loggers/tensorboard.py:277: UserWarning: Could not log computational graph to TensorBoard: The model.example_input_array attribute is not set or input_array was not given.
rank_zero_warn(
'ZeroOptimizer' object has no attribute 'state'
Traceback (most recent call last):
File "/data/stablediffusionv2_finetune/main.py", line 920, in
trainer.fit(model, data)
File "/home/ubuntu/anaconda3/envs/ldmco/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 602, in fit
call._call_and_handle_interrupt(
File "/home/ubuntu/anaconda3/envs/ldmco/lib/python3.9/site-packages/pytorch_lightning/trainer/call.py", line 36, in _call_and_handle_interrupt
return trainer.strategy.launcher.launch(trainer_fn, *args, trainer=trainer, **kwargs)
File "/home/ubuntu/anaconda3/envs/ldmco/lib/python3.9/site-packages/pytorch_lightning/strategies/launchers/subprocess_script.py", line 88, in launch
return function(*args, **kwargs)
File "/home/ubuntu/anaconda3/envs/ldmco/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 644, in _fit_impl
self._run(model, ckpt_path=self.ckpt_path)
File "/home/ubuntu/anaconda3/envs/ldmco/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 1093, in _run
self._checkpoint_connector.restore_training_state()
File "/home/ubuntu/anaconda3/envs/ldmco/lib/python3.9/site-packages/pytorch_lightning/trainer/connectors/checkpoint_connector.py", line 286, in restore_training_state
self.restore_optimizers_and_schedulers()
File "/home/ubuntu/anaconda3/envs/ldmco/lib/python3.9/site-packages/pytorch_lightning/trainer/connectors/checkpoint_connector.py", line 382, in restore_optimizers_and_schedulers
self.restore_optimizers()
File "/home/ubuntu/anaconda3/envs/ldmco/lib/python3.9/site-packages/pytorch_lightning/trainer/connectors/checkpoint_connector.py", line 397, in restore_optimizers
self.trainer.strategy.load_optimizer_state_dict(self._loaded_checkpoint)
File "/home/ubuntu/anaconda3/envs/ldmco/lib/python3.9/site-packages/pytorch_lightning/strategies/strategy.py", line 369, in load_optimizer_state_dict
_optimizer_to_device(optimizer, self.root_device)
File "/home/ubuntu/anaconda3/envs/ldmco/lib/python3.9/site-packages/lightning_lite/utilities/optimizer.py", line 33, in _optimizer_to_device
for p, v in optimizer.state.items():
AttributeError: 'ZeroOptimizer' object has no attribute 'state'
Environment
No response
Were you using zero optimizer at your last run? ZeroOptimizer class does not have state variable (it has optim_state), so there is a mismatch with the checkpoint.
Jan 26
'23 12:01
JThh
Hi @JThh , I meet the same issue. WOuld you give me an example of saving ZeroOptimizer and loading ZeroOptimizer? Thanks
Hi @JThh ,
Quick question:
In the code, the optimizer and model is loaded from saved checkpoint when the local_rank == 0. If distributed training is used to train the model, I think all the processes should load the saved checkpoint, is it correct? Thanks
Based on your second question, you should've found our saving and loading ckpt utilities.
This line has already gathered the tensors and wiped out inter-device differences before we save at rank 0.
And this line broadcasts model weights across devices after loading from rank 0. The same goes for optimizers.
Hope this helps!
Apr 18
'23 08:04
JThh