RaymondLi0 comments

Results 8 comments of


                                            RaymondLi0

Support additional evaluation frameworks

VLM evals: - https://github.com/open-compass/VLMEvalKit - https://github.com/ServiceNow/stardoc

Reference-model is slow on long sequences, especially with TP>1

Yes this is with #347 . Main does not allow to run tensor-parallel distillation currently. Will check with the cuda synchronization

[bug] Unable to freeze specific layers of a pretrained model

Even after patching the creeping type parameters: https://github.com/ServiceNow/Fast-LLM/commit/f7a0837d5ba134a3941d2599dfc174b3eb3ef62f, loading the pretrained model into a `PatternBlockSequence` fails: ``` ... File "/home/toolkit/code/Fast-LLM/fast_llm/engine/multi_stage/fast_llm_model.py", line 38, in load_checkpoint converter = config.format.get_handler_class()(self) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/toolkit/code/Fast-LLM/fast_llm/engine/checkpoint/huggingface.py",...

[bug] Unable to freeze specific layers of a pretrained model

The above can be fixed by adding support for compatible pattern-block-sequence in Llama conversion. See: https://github.com/ServiceNow/Fast-LLM/pull/388/commits/6f2d5e3070f3d4188dd4aab6c63d194d1da916c1 and https://github.com/ServiceNow/Fast-LLM/pull/388/commits/52517190ecf61f16dedc8e68c7d305af2beece74

activation-level disillation

Another sanity check: Student is initialized from the teacher, but with randomly initialized attention layers. Then we distill activations while freezing the MLPs. The loss (grey) is still quite far...

activation-level disillation

Resetting and distilling only one layer, freezing the rest of the model gives satisfactory results: Note some changes were required to allow loading a pretrained model while freezing certain layers...

activation-level disillation

Now getting similar loss curves with TP=1, TP=2, and STP=2

activation-level disillation

Thank you for the reviews! The comments are addressed, could you have another look? @jlamypoirier