KBLaM icon indicating copy to clipboard operation
KBLaM copied to clipboard

Readme train.py command doesn't create saved output once training is complete

Open grctest opened this issue 7 months ago • 8 comments

The readme instructs to use 601 steps: --total_steps 601

https://github.com/microsoft/KBLaM/blob/main/README.md?plain=1#L58

However the train.py file has a static value of 3000 steps for the save period: save_period=3000

https://github.com/microsoft/KBLaM/blob/main/experiments/train.py#L955

After reaching 100% training, the output folder was empty:

  Training ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% Loss: 9.2729 0:00:002 0:00:2704:29

Took about 4 hours to train, but then it didn't save the result 😅

So should the suggested steps in the readme be increased, or should the static save_period value be lowered?

grctest avatar Jun 24 '25 15:06 grctest

I found the same last week, but hadn't the time for posting this isssue.

It is a bad idea to hard-code the step size. I think, it would be better to make it configurable and set it over an option in the call like --save_intervall 1000. The best would be, if it is not given, to set it to a default value which corresponds to the number of total_steps. With this default the model will always be saved at the end.

The notion of 'steps' is not really understandable. What does it mean? Is it related to the notion of 'epochs'? Or are they unrelated? Better would be to resort to the notion of 'epochs' used in ML-algorithms. This is understandable as a complete pass through all trainings examples.

ThomasHoppe avatar Jun 24 '25 16:06 ThomasHoppe

@ThomasHoppe thanks for pointing out about the save_interval issue, I am coping @ti250 to fix this.

For the notion of step, it means one gradient descent step, notice that here we intentionally adopt the notion of "number of step" rather than the number of epochs because we not necessarily need to have a complete pass over all training examples. But you can understand one epoch as TrainingDataSize / BatchSize number of steps

xidulu avatar Jun 24 '25 16:06 xidulu

For the notion of step, it means one gradient descent step, notice that here we intentionally adopt the notion of "number of step" rather than the number of epochs because we not necessarily need to have a complete pass over all training examples. But you can understand one epoch as TrainingDataSize / BatchSize number of steps

Hmm, I think "steps of gradient decent" is a quite strange parameter, since it depends on the algorithm and is not directly predictabe by users of the software, especially if the number of steps varies between trainings data.

If you just want to make it independent of a complete pass over the training sets, I think, the "number of training cases used" would be easier to understand and better predictable.

ThomasHoppe avatar Jun 25 '25 12:06 ThomasHoppe

How many steps are required to achieve KBLaM's findings? Am I correct in understanding the whitepaper used 20,000 steps?

I'm trying to run eval on 400 steps and am wondering if I'm several hundred/thousand steps short of functioning predictions, or if I'm pretty close with just 400 for a proof of concept?

grctest avatar Jun 26 '25 10:06 grctest

@grctest IIRC 400 - 800 steps should be enough to have the model output some reasonable output (human readable)

xidulu avatar Jun 26 '25 14:06 xidulu

How many steps are required to achieve KBLaM's findings? Am I correct in understanding the whitepaper used 20,000 steps?

I'm trying to run eval on 400 steps and am wondering if I'm several hundred/thousand steps short of functioning predictions, or if I'm pretty close with just 400 for a proof of concept?

I made two experiments running train.py with several thousands of steps on an H100 and another one with just some hundred steps on an 64GB Orin. Under the same seed, the training loss had the same behavior. Of course eventually, it was worse for the Orin run. I think an evaluation even with a smaller number of steps will give a reasonable approximation.

Sorry, I cannot give you the exact number of steps, since I have both systems only available in my office.

ThomasHoppe avatar Jun 26 '25 15:06 ThomasHoppe

I think since learning rate decay is used so the loss curve will eventually flatten regardless of the total number of steps used.

But per my experience, it won't hurt if you train longer, I didn't remember observing any severe overfitting happening.

xidulu avatar Jun 26 '25 20:06 xidulu

Thanks Xi for tagging me, I'll open a PR

ti250 avatar Jul 29 '25 13:07 ti250