sd-scripts icon indicating copy to clipboard operation
sd-scripts copied to clipboard

Adjustments to resuming training

Open Cauldrath opened this issue 1 year ago • 3 comments

Currently skips the resumed epoch if partway through and calculates global steps as only the number of steps into the current epoch.

These changes make it resume mid epoch on the appropriate step, with the right global step count, so max steps will be honored.

This also includes a change to just do a multiplication instead of a for loop over every elapsed epoch.

Cauldrath avatar Jul 01 '24 03:07 Cauldrath

This fixed the problems I was having, when resuming from 200, 400 to 1000 steps (when it outputted "epoch is incremented. current_epoch: 0, epoch: 1") it worked as intended, but resuming from the 1200 steps onwards (when it outputted "epoch is incremented. current_epoch: 0, epoch: 2") the training continued after the maximum amount of steps (it also didn't even save a model when reaching the max steps) image

slashedstar avatar Jul 01 '24 03:07 slashedstar

Thank you for this! Sorry, I didn't test with --max_train_steps option. In my understanding, this fixes the issue when --max_train_steps is specified.

kohya-ss avatar Jul 08 '24 11:07 kohya-ss

Yes, --max_train_steps combined with resuming or setting --initial_steps is the main problem if it isn't starting in the first epoch.

Cauldrath avatar Jul 13 '24 03:07 Cauldrath