Fast-LLM icon indicating copy to clipboard operation
Fast-LLM copied to clipboard

activation-level disillation

Open RaymondLi0 opened this issue 3 months ago • 3 comments

✨ Description

Closes #385

TODOs:

  • [ ] TP / sequence-tensor parallel: inconsistent gradients between TP=1 and TP=2
  • [ ] Add tests: train with student==teacher, check that all losses are 0 and gradients as well.

Sanity checks:

  • loading student and teacher from the same pretrained model gives 0 loss ✔️. But loss then increases to a small value instead of staying at 0.
  • Distilling from scratch with the same architecture doesn't lead to 0 loss (orange)
  • Distilling the pretrained model, but with a sliding-window leads to low loss, lower with a larger window (green and purple) Screenshot 2025-11-18 at 10 47 00 AM
  lm-loss Logit distillation Logit + Activation distillation
Tokens/s/gpu 3500 2900 2800
max_reserved (GB) 44 77 78

With the caveat that distillation seems to experience memory spikes at specific points in training. The actual usage was lower most of the time:

Screenshot 2025-11-18 at 2 41 05 PM

🔍 Type of change

Select all that apply:

  • [ ] 🐛 Bug fix (non-breaking change that addresses a specific issue)
  • [x] 🚀 New feature (non-breaking change that adds functionality)
  • [ ] ⚠️ Breaking change (a change that could affect existing functionality)
  • [ ] 📈 Performance improvement/optimization (improves speed, memory usage, or efficiency)
  • [ ] 🛠️ Code refactor (non-functional changes that improve code readability, structure, etc.)
  • [ ] 📦 Dependency bump (updates dependencies, including Dockerfile or package changes)
  • [ ] 📝 Documentation change (updates documentation, including new content or typo fixes)
  • [ ] 🔧 Infrastructure/Build change (affects build process, CI/CD, or dependencies)

Testing

  • [ ] 🧪 I have added or updated tests to cover my changes.
  • [ ] ✔️ New and existing tests pass locally with my changes.
  • [ ] 🚦 I have tested these changes on GPUs and verified training stability.
  • [ ] 🏋️ I have tested the changes on realistic training workloads, if applicable.

Performance Impact

  • [ ] 📊 I have run benchmarks where applicable to evaluate the performance impact.
  • [ ] ✅ The benchmarks show no performance regression.
  • [ ] 🚀 The benchmarks indicate a potential performance improvement.
  • [ ] ⚠️ The benchmarks indicate a potential performance degradation.
  • [ ] 📈 I have provided benchmark results and detailed any performance impact below, if applicable.

📊 Performance Impact Details

If there is any impact on performance, describe it and provide benchmark results, if applicable:

RaymondLi0 avatar Nov 14 '25 20:11 RaymondLi0

great progress! did you freeze everything except the randomly initialized mixers?

tscholak avatar Nov 19 '25 03:11 tscholak

Another sanity check: Student is initialized from the teacher, but with randomly initialized attention layers. Then we distill activations while freezing the MLPs.

The loss (grey) is still quite far from 0 (similarly to the orange run where the student is initialized completely from scratch) Screenshot 2025-11-21 at 3 18 16 PM

RaymondLi0 avatar Nov 21 '25 20:11 RaymondLi0

Resetting and distilling only one layer, freezing the rest of the model gives satisfactory results:

Screenshot 2025-11-26 at 9 45 20 AM

Note some changes were required to allow loading a pretrained model while freezing certain layers (#394 )

RaymondLi0 avatar Nov 26 '25 22:11 RaymondLi0

Now getting similar loss curves with TP=1, TP=2, and STP=2 Screenshot 2025-12-02 at 3 49 21 PM

RaymondLi0 avatar Dec 02 '25 20:12 RaymondLi0

Thank you for the reviews! The comments are addressed, could you have another look? @jlamypoirier

RaymondLi0 avatar Dec 03 '25 22:12 RaymondLi0