Optimizer state isn't preserved across runs
I often have to restart a run, either to fix something in my reward function, in response to an OOM or crash that broke training, etc. When I do, by restarting the training process the optimizer state is thrown away. I’m worried that this might lead to worse performance than just letting a run go all the way through. Is it easy to save the optimizer state along with the weights so we can truly resume as if nothing happened?
It should be fairly straightforward to implement, it just probably takes up more space than the LoRA adapters. I'm uncertain if saving the optimizer state should be the default behavior or not. 🤔
(Not sure about this) maybe a reasonable default would be to just save a single optimizer state for the latest checkpoint, since in the common case you're just resuming from that one. If for whatever reason you want to resume from an older checkpoint you don't get to use the optimizer state but oh well.