Fast-LLM Reference-model is slow on long sequences, especially with TP>1

Here are some latency numbers, run with https://github.com/ServiceNow/Fast-LLM/tree/874cb2a875a439cff10d18a67b293ed59831ce4e, by measuring the time taken to run the reference-model's forward https://github.com/ServiceNow/Fast-LLM/blob/874cb2a875a439cff10d18a67b293ed59831ce4e/fast_llm/models/gpt/model.py#L336-L342.

With TP=2, mbs=1, the time taken to run the reference model is much larger in comparison to TP=1. Another puzzling point is that the reference-model-forward-time is larger in TP=2,mbs=1 than in TP=2,mbs=2. Could be an issue in how this ref-model-forward time is measured here?

Seq-length	TP	MBS	BS	Sequential micro-batches	Teacher-size	Studen-size	Ref-model forward (ms)	Step-time (ms)
2048	1	1	16	1	4.6B	4.6B	24	206
4096	1	1	16	1	4.6B	4.6B	46	333
8192	1	1	16	1	4.6B	4.6B	95	656
2048	2	1	8	1	4.6B	4.6B	80	230
4096	2	1	8	1	4.6B	4.6B	150	367
8192	2	1	8	1	4.6B	4.6B	301	655
2048	2	2	16	1	4.6B	4.6B	33	231
4096	2	2	16	1	4.6B	4.6B	59	376
8192	2	2	16	1	4.6B	4.6B	121	739

Aug 14 '25 16:08 RaymondLi0

Hi, I need more information to look into this. Is this with #347? Does the same problem happen in main? What is the config being run?

The debug logs are missing cuda synchronization, that could be part of the problem.

Aug 15 '25 16:08 jlamypoirier

Yes this is with #347 . Main does not allow to run tensor-parallel distillation currently. Will check with the cuda synchronization

Aug 18 '25 15:08 RaymondLi0