RaymondLi0
RaymondLi0
# 🎯 **Goal (What & Why)** Knowledge distillation was added in #229 , but it currently disables the standard LM loss. Enabling knowledge distillation and standard LM loss would allow...
# ✨ Description For visibility only. There is no intention to merge this soon. Specify different lr-scales per layer. ## 🔍 Type of change Select all that apply: - [...
# 🐞 Describe the Bug Conversion fails when using `layers_per_step` together with `input_format=fast_llm` example job: `7ada4a96-4b5d-43de-a156-ebea5f359a33` ``` Global counter mismatch for parameter "layers.8.norm_1.weight" and shard "weights": 0 != 2048 [...]...
# ✨ Description Closes #385 TODOs: - [ ] TP / sequence-tensor parallel: inconsistent gradients between TP=1 and TP=2 - [ ] Add tests: train with student==teacher, check that all...
# 🐞 Describe the Bug I'm trying to freeze specific layers of a pretrained model (for example only layer 0). The problem is that loading a pretrained model like Apriel-Thinker...
# ✨ Description For tracking: Hybrid-SSM dev branch ## Outstanding issues - Missing preprocessing when flash-attn is disabled for vision-encoder. -> KeyError: 'image_encoder_attention_mask' (is an issue to run the tests)...
# 🎯 **Goal (What & Why)** Add activation-level distillation, usually leading to better student performance. # 🚀 **Execution Plan** ### **Step 1: What is the smallest working version?** - Distill...
Here are some latency numbers, run with https://github.com/ServiceNow/Fast-LLM/tree/874cb2a875a439cff10d18a67b293ed59831ce4e, by measuring the time taken to run the reference-model's forward https://github.com/ServiceNow/Fast-LLM/blob/874cb2a875a439cff10d18a67b293ed59831ce4e/fast_llm/models/gpt/model.py#L336-L342. With TP=2, mbs=1, the time taken to run the reference model...
Add a test for the script added in #284