Reference-model is slow on long sequences, especially with TP>1
Here are some latency numbers, run with https://github.com/ServiceNow/Fast-LLM/tree/874cb2a875a439cff10d18a67b293ed59831ce4e, by measuring the time taken to run the reference-model's forward https://github.com/ServiceNow/Fast-LLM/blob/874cb2a875a439cff10d18a67b293ed59831ce4e/fast_llm/models/gpt/model.py#L336-L342.
With TP=2, mbs=1, the time taken to run the reference model is much larger in comparison to TP=1. Another puzzling point is that the reference-model-forward-time is larger in TP=2,mbs=1 than in TP=2,mbs=2. Could be an issue in how this ref-model-forward time is measured here?
| Seq-length | TP | MBS | BS | Sequential micro-batches | Teacher-size | Studen-size | Ref-model forward (ms) | Step-time (ms) |
|---|---|---|---|---|---|---|---|---|
| 2048 | 1 | 1 | 16 | 1 | 4.6B | 4.6B | 24 | 206 |
| 4096 | 1 | 1 | 16 | 1 | 4.6B | 4.6B | 46 | 333 |
| 8192 | 1 | 1 | 16 | 1 | 4.6B | 4.6B | 95 | 656 |
| 2048 | 2 | 1 | 8 | 1 | 4.6B | 4.6B | 80 | 230 |
| 4096 | 2 | 1 | 8 | 1 | 4.6B | 4.6B | 150 | 367 |
| 8192 | 2 | 1 | 8 | 1 | 4.6B | 4.6B | 301 | 655 |
| 2048 | 2 | 2 | 16 | 1 | 4.6B | 4.6B | 33 | 231 |
| 4096 | 2 | 2 | 16 | 1 | 4.6B | 4.6B | 59 | 376 |
| 8192 | 2 | 2 | 16 | 1 | 4.6B | 4.6B | 121 | 739 |
Hi, I need more information to look into this. Is this with #347? Does the same problem happen in main? What is the config being run?
The debug logs are missing cuda synchronization, that could be part of the problem.
Yes this is with #347 . Main does not allow to run tensor-parallel distillation currently. Will check with the cuda synchronization