ART icon indicating copy to clipboard operation
ART copied to clipboard

Optimizer state isn't preserved across runs

Open corbt opened this issue 9 months ago • 2 comments

I often have to restart a run, either to fix something in my reward function, in response to an OOM or crash that broke training, etc. When I do, by restarting the training process the optimizer state is thrown away. I’m worried that this might lead to worse performance than just letting a run go all the way through. Is it easy to save the optimizer state along with the weights so we can truly resume as if nothing happened?

corbt avatar Apr 21 '25 18:04 corbt

It should be fairly straightforward to implement, it just probably takes up more space than the LoRA adapters. I'm uncertain if saving the optimizer state should be the default behavior or not. 🤔

bradhilton avatar Apr 21 '25 22:04 bradhilton

(Not sure about this) maybe a reasonable default would be to just save a single optimizer state for the latest checkpoint, since in the common case you're just resuming from that one. If for whatever reason you want to resume from an older checkpoint you don't get to use the optimizer state but oh well.

corbt avatar Apr 21 '25 23:04 corbt