RaymondLi0 issues

Results 9 issues of


                                            RaymondLi0

Support LM-loss and knowledge distillation together

# 🎯 **Goal (What & Why)** Knowledge distillation was added in #229 , but it currently disables the standard LM loss. Enabling knowledge distillation and standard LM loss would allow...

enhancement

For visibility: add per-layer lr-scale

# ✨ Description For visibility only. There is no intention to merge this soon. Specify different lr-scales per layer. ## 🔍 Type of change Select all that apply: - [...

[bug] Conversion fails when using `layers_per_step` with some input formats

# 🐞 Describe the Bug Conversion fails when using `layers_per_step` together with `input_format=fast_llm` example job: `7ada4a96-4b5d-43de-a156-ebea5f359a33` ``` Global counter mismatch for parameter "layers.8.norm_1.weight" and shard "weights": 0 != 2048 [...]...

bug

need update

activation-level disillation

# ✨ Description Closes #385 TODOs: - [ ] TP / sequence-tensor parallel: inconsistent gradients between TP=1 and TP=2 - [ ] Add tests: train with student==teacher, check that all...

[bug] Unable to freeze specific layers of a pretrained model

# 🐞 Describe the Bug I'm trying to freeze specific layers of a pretrained model (for example only layer 0). The problem is that loading a pretrained model like Apriel-Thinker...

bug

[hybrid_dev] Hybrid dev branch

# ✨ Description For tracking: Hybrid-SSM dev branch ## Outstanding issues - Missing preprocessing when flash-attn is disabled for vision-encoder. -> KeyError: 'image_encoder_attention_mask' (is an issue to run the tests)...

Activation/feature-level distillation

# 🎯 **Goal (What & Why)** Add activation-level distillation, usually leading to better student performance. # 🚀 **Execution Plan** ### **Step 1: What is the smallest working version?** - Distill...

enhancement

Reference-model is slow on long sequences, especially with TP>1

Here are some latency numbers, run with https://github.com/ServiceNow/Fast-LLM/tree/874cb2a875a439cff10d18a67b293ed59831ce4e, by measuring the time taken to run the reference-model's forward https://github.com/ServiceNow/Fast-LLM/blob/874cb2a875a439cff10d18a67b293ed59831ce4e/fast_llm/models/gpt/model.py#L336-L342. With TP=2, mbs=1, the time taken to run the reference model...

need update

Add test for MTP-add-heads script

Add a test for the script added in #284