Shivam Sahni
Shivam Sahni
Hi, you should see the perf gains while training (fwd + bwd)
worked on #628 ... didn't see this issue before so was working on this independently In my implementation: 1. KL is only calculated if beta is non-zero. Assume user directly...
@matthewdouglas Would really appreciate any tips here TIA!
> Implementing the chunked JSD loss function for FSDP-enabled training The approach is similar to LigerORPOTrainer where we pass the models' lm_head weights and last hidden states to Liger{ORPO/JSD}Loss which...
#623 @zcnrex thanks for the draft PR. HF released [Deepseek V3 support](https://github.com/huggingface/transformers/tree/main/src/transformers/models/deepseek_v3) recently and we should be able to test this patch now.
Sounds good @hongpeng-guo, a separate base class for distillation is absolutely needed!
https://github.com/triton-lang/triton/issues/5205
Would like to take this up!