Readme train.py command doesn't create saved output once training is complete
The readme instructs to use 601 steps: --total_steps 601
https://github.com/microsoft/KBLaM/blob/main/README.md?plain=1#L58
However the train.py file has a static value of 3000 steps for the save period: save_period=3000
https://github.com/microsoft/KBLaM/blob/main/experiments/train.py#L955
After reaching 100% training, the output folder was empty:
Training ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% Loss: 9.2729 0:00:002 0:00:2704:29
Took about 4 hours to train, but then it didn't save the result 😅
So should the suggested steps in the readme be increased, or should the static save_period value be lowered?
I found the same last week, but hadn't the time for posting this isssue.
It is a bad idea to hard-code the step size. I think, it would be better to make it configurable and set it over an option in the call like --save_intervall 1000. The best would be, if it is not given, to set it to a default value which corresponds to the number of total_steps. With this default the model will always be saved at the end.
The notion of 'steps' is not really understandable. What does it mean? Is it related to the notion of 'epochs'? Or are they unrelated? Better would be to resort to the notion of 'epochs' used in ML-algorithms. This is understandable as a complete pass through all trainings examples.
@ThomasHoppe thanks for pointing out about the save_interval issue, I am coping @ti250 to fix this.
For the notion of step, it means one gradient descent step, notice that here we intentionally adopt the notion of "number of step" rather than the number of epochs because we not necessarily need to have a complete pass over all training examples. But you can understand one epoch as TrainingDataSize / BatchSize number of steps
For the notion of step, it means one gradient descent step, notice that here we intentionally adopt the notion of "number of step" rather than the number of epochs because we not necessarily need to have a complete pass over all training examples. But you can understand one epoch as
TrainingDataSize / BatchSizenumber of steps
Hmm, I think "steps of gradient decent" is a quite strange parameter, since it depends on the algorithm and is not directly predictabe by users of the software, especially if the number of steps varies between trainings data.
If you just want to make it independent of a complete pass over the training sets, I think, the "number of training cases used" would be easier to understand and better predictable.
How many steps are required to achieve KBLaM's findings? Am I correct in understanding the whitepaper used 20,000 steps?
I'm trying to run eval on 400 steps and am wondering if I'm several hundred/thousand steps short of functioning predictions, or if I'm pretty close with just 400 for a proof of concept?
@grctest IIRC 400 - 800 steps should be enough to have the model output some reasonable output (human readable)
How many steps are required to achieve KBLaM's findings? Am I correct in understanding the whitepaper used 20,000 steps?
I'm trying to run eval on 400 steps and am wondering if I'm several hundred/thousand steps short of functioning predictions, or if I'm pretty close with just 400 for a proof of concept?
I made two experiments running train.py with several thousands of steps on an H100 and another one with just some hundred steps on an 64GB Orin. Under the same seed, the training loss had the same behavior. Of course eventually, it was worse for the Orin run. I think an evaluation even with a smaller number of steps will give a reasonable approximation.
Sorry, I cannot give you the exact number of steps, since I have both systems only available in my office.
I think since learning rate decay is used so the loss curve will eventually flatten regardless of the total number of steps used.
But per my experience, it won't hurt if you train longer, I didn't remember observing any severe overfitting happening.
Thanks Xi for tagging me, I'll open a PR