Fast-LLM icon indicating copy to clipboard operation
Fast-LLM copied to clipboard

Reference-model is slow on long sequences, especially with TP>1

Open RaymondLi0 opened this issue 6 months ago • 2 comments

Here are some latency numbers, run with https://github.com/ServiceNow/Fast-LLM/tree/874cb2a875a439cff10d18a67b293ed59831ce4e, by measuring the time taken to run the reference-model's forward https://github.com/ServiceNow/Fast-LLM/blob/874cb2a875a439cff10d18a67b293ed59831ce4e/fast_llm/models/gpt/model.py#L336-L342.

With TP=2, mbs=1, the time taken to run the reference model is much larger in comparison to TP=1. Another puzzling point is that the reference-model-forward-time is larger in TP=2,mbs=1 than in TP=2,mbs=2. Could be an issue in how this ref-model-forward time is measured here?

Seq-length TP MBS BS Sequential micro-batches Teacher-size Studen-size Ref-model forward (ms) Step-time (ms)
2048 1 1 16 1 4.6B 4.6B 24 206
4096 1 1 16 1 4.6B 4.6B 46 333
8192 1 1 16 1 4.6B 4.6B 95 656
2048 2 1 8 1 4.6B 4.6B 80 230
4096 2 1 8 1 4.6B 4.6B 150 367
8192 2 1 8 1 4.6B 4.6B 301 655
2048 2 2 16 1 4.6B 4.6B 33 231
4096 2 2 16 1 4.6B 4.6B 59 376
8192 2 2 16 1 4.6B 4.6B 121 739

RaymondLi0 avatar Aug 14 '25 16:08 RaymondLi0

Hi, I need more information to look into this. Is this with #347? Does the same problem happen in main? What is the config being run?

The debug logs are missing cuda synchronization, that could be part of the problem.

jlamypoirier avatar Aug 15 '25 16:08 jlamypoirier

Yes this is with #347 . Main does not allow to run tensor-parallel distillation currently. Will check with the cuda synchronization

RaymondLi0 avatar Aug 18 '25 15:08 RaymondLi0