OneTrainer icon indicating copy to clipboard operation
OneTrainer copied to clipboard

added Skip Save interval before automatic saving starts

Open dxqb opened this issue 1 year ago • 6 comments

Automatic saving is used to have multiple checkpoints after training, so you can choose the best one. However, you know in advance that the first x000 steps will not be the best one. This PR adds a function to skip saving in those first x000 steps.

This example saves every 500 steps, but not within the first 3000:

grafik

Especially useful for full finetunes with large file sizes. The first few checkpoints waste both disk space and the time it takes to save.

dxqb avatar Sep 29 '24 21:09 dxqb

By the way, the GUI has become very confusing now.

  • Save After: 500 STEP
  • Skip Save: 3000 STEP

How about this?

  • Save Every: 500 STEP
  • Skip Early Steps: 3000 STEP

Something like that? Or perhaps there's an even better way to label this?

They could be visually relabeled without changing the internal save_after setting's name for backwards compatibility reasons. Or perhaps that can be changed too. Nero will be a much better judge of that.

Arcitec avatar Sep 29 '24 21:09 Arcitec

Do you think there is a use case for using different units for save_after_unit and skip_save_unit? If not, what do you think about adding only a new value so we have save_after, save_after_skip, save_after_unit. That could be displayed in a single component with these three inputs and maybe a better label. leaving the save_after_skip input empty would disable skipping and always save.

They could be visually relabeled without changing the internal save_after setting's name for backwards compatibility reasons

changing the name of the setting would be possible by adding a migration to TrainConfig

I think it would also make sense to add the same functionality for backups.

Nerogar avatar Sep 29 '24 22:09 Nerogar

Yeah, having only a single unit selector dropdown seems to make sense. I can't think of any scenario where we'd want to use "Save Every 500 Steps, But Skip Early Steps: 1 hour". That's a very confusing way of thinking about it. It makes more sense to use the same unit for both things.

Adding a setting-migration to use the clearer label save_every is a good idea too.

And yeah, having this functionality for backups would be a nice bonus.

Arcitec avatar Sep 29 '24 22:09 Arcitec

Do you think there is a use case for using different units for save_after_unit and skip_save_unit?

No, but after seeing that the number entry and unit is one combined UI widget I went for this to keep the PR simple. If you all agree that this PR will be merged, I'll look into a better UI to have only one unit.

I think it would also make sense to add the same functionality for backups.

I think rolling backups are already that functionality best suited for backups. You want to keep all saves except the early ones (skip_save), but you want the early backups until they are outdated (rolling backups)

dxqb avatar Sep 29 '24 22:09 dxqb

If you all agree that this PR will be merged

Yeah it's a great idea! Let's continue it. :)

As for displaying it in a single component/label row, it might be hard to make that clear for the users. But maybe something like this:

Save Every / Skip Early:   [  500 ]    [    0 ]    [  STEP      ]

But that may not fit the label width of the other rows, and may still not be clear enough.

So maybe two rows:

Save Every:   [  500 ]    [  STEP      ]
Skip Early:   [    0 ]

Arcitec avatar Sep 29 '24 22:09 Arcitec

this is a very welcomed addition. Before this I was thinking about an option to specify multiple exact numbers of steps/epochs, separated by a comma, to be saved. But also this is a good enough option. Hope it will be implemented soon. Thanks.

djp3k05 avatar Oct 04 '24 09:10 djp3k05

grafik

Internal parameter names adjusted and config migration added

For this I needed to add a parameter to util.ui.components.entry and noticed a lot of code duplication there. Carefully checked: can be removed without changing anything

Further noticed that the datatype of save_every is float, but the default setting is an int. Did not change that, because I'm not sure this wasn't intentional and might have some purpose, so I duplicated that for save_skip_first

dxqb avatar Oct 16 '24 17:10 dxqb

... Further noticed that the datatype of save_every is float, but the default setting is an int. Did not change that, because I'm not sure this wasn't intentional and might have some purpose, so I duplicated that for save_skip_first

Just as an aside I can think of instances where the user will want the first few epochs then after that none for awhile, this is particularly useful for finetunes like my group did. Needed to watch and test the model carefully for the first 10 epochs but after that it was just slow n steady up until we reached 40 epochs.

O-J1 avatar Oct 17 '24 09:10 O-J1

Further noticed that the datatype of save_every is float, but the default setting is an int. Did not change that, because I'm not sure this wasn't intentional and might have some purpose, so I duplicated that for save_skip_first

No, it's not intentional. And since this is only a type hint that's not actually really used apart from some linting, I've changed it.

There's also a small issue with the implementation of single_action_elapsed(). When resuming from a backup, the self.__start_time will be reset. Which means that saving every 30min, but skipping the first 60min will also skip the first 60min after resuming from a backup. But I guess that's an edge case that we can ignore for now.

Nerogar avatar Oct 20 '24 20:10 Nerogar

Excellent work @dxqbYD! :)

When resuming from a backup, the self.__start_time will be reset. Which means that saving every 30min, but skipping the first 60min will also skip the first 60min after resuming from a backup. But I guess that's an edge case that we can ignore for now.

If it's just a timestamp, it's doable by storing the elapsed time in the backup and then creating a fake start_time based on that when resuming. Generating a date object that's "now() minus X seconds ago" to restore the backed up "elapsed time".

Should we make a ticket to track the bug, or let it be? It's not a big bug.

Arcitec avatar Oct 22 '24 00:10 Arcitec