Shivam Sahni comments

Results 8 comments of


                                            Shivam Sahni

No Significant Improvement Observed in Model Training Speed

Hi, you should see the perf gains while training (fwd + bwd)

Generalized PPO loss (& improve current GRPO loss)

worked on #628 ... didn't see this issue before so was working on this independently In my implementation: 1. KL is only calculated if beta is non-zero. Assume user directly...

Paged optimizer resuming from checkpoint - attributeError: 'int' object has no attribute 'cpu'

@matthewdouglas Would really appreciate any tips here TIA!

GKD trainer + chunked JSD loss + FSDP

> Implementing the chunked JSD loss function for FSDP-enabled training The approach is similar to LigerORPOTrainer where we pass the models' lm_head weights and last hidden states to Liger{ORPO/JSD}Loss which...

[DRAFT] Add deepseek v3 monkey patch

#623 @zcnrex thanks for the draft PR. HF released [Deepseek V3 support](https://github.com/huggingface/transformers/tree/main/src/transformers/models/deepseek_v3) recently and we should be able to test this patch now.

[RFC] Liger FlexChunkLoss: Alignment and Distillation loss

Sounds good @hongpeng-guo, a separate base class for distillation is absolutely needed!

ValueError: Pointer argument (at 0) cannot be accessed from Triton (cpu tensor?)

https://github.com/triton-lang/triton/issues/5205

DeepSeek Native Sparse Attention (NSA) Kernel

Would like to take this up!