RaymondLi0
RaymondLi0
VLM evals: - https://github.com/open-compass/VLMEvalKit - https://github.com/ServiceNow/stardoc
Yes this is with #347 . Main does not allow to run tensor-parallel distillation currently. Will check with the cuda synchronization
Even after patching the creeping type parameters: https://github.com/ServiceNow/Fast-LLM/commit/f7a0837d5ba134a3941d2599dfc174b3eb3ef62f, loading the pretrained model into a `PatternBlockSequence` fails: ``` ... File "/home/toolkit/code/Fast-LLM/fast_llm/engine/multi_stage/fast_llm_model.py", line 38, in load_checkpoint converter = config.format.get_handler_class()(self) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/toolkit/code/Fast-LLM/fast_llm/engine/checkpoint/huggingface.py",...
The above can be fixed by adding support for compatible pattern-block-sequence in Llama conversion. See: https://github.com/ServiceNow/Fast-LLM/pull/388/commits/6f2d5e3070f3d4188dd4aab6c63d194d1da916c1 and https://github.com/ServiceNow/Fast-LLM/pull/388/commits/52517190ecf61f16dedc8e68c7d305af2beece74
Another sanity check: Student is initialized from the teacher, but with randomly initialized attention layers. Then we distill activations while freezing the MLPs. The loss (grey) is still quite far...
Resetting and distilling only one layer, freezing the rest of the model gives satisfactory results: Note some changes were required to allow loading a pretrained model while freezing certain layers...
Now getting similar loss curves with TP=1, TP=2, and STP=2
Thank you for the reviews! The comments are addressed, could you have another look? @jlamypoirier