Shivam Sahni

Results 8 comments of Shivam Sahni

Hi, you should see the perf gains while training (fwd + bwd)

worked on #628 ... didn't see this issue before so was working on this independently In my implementation: 1. KL is only calculated if beta is non-zero. Assume user directly...

> Implementing the chunked JSD loss function for FSDP-enabled training The approach is similar to LigerORPOTrainer where we pass the models' lm_head weights and last hidden states to Liger{ORPO/JSD}Loss which...

#623 @zcnrex thanks for the draft PR. HF released [Deepseek V3 support](https://github.com/huggingface/transformers/tree/main/src/transformers/models/deepseek_v3) recently and we should be able to test this patch now.

Sounds good @hongpeng-guo, a separate base class for distillation is absolutely needed!