Liger-Kernel
Liger-Kernel copied to clipboard
Z Loss in CE
🚀 The feature, motivation and pitch
Often used in pretraining of LMs for stabilization, i.e. the recent Chameleon & PaLM.
Alternatives
flash-attn has implementations of abovementioned features, however, does not support fusing with linear head.
Additional context
No response
Legit ask! We have tracked smooth label at https://github.com/linkedin/Liger-Kernel/issues/81. I modify the title for only Z loss to prevent duplication.
@ByronHsu #take To support z loss, I just need a little add-ons to #198. I'll work on it after merging label_smoothing PR.