direct-preference-optimization
direct-preference-optimization copied to clipboard
How to gurantee the output.logits.shape[:-1] == labels.shape
How to gurantee the two the same?
When I train a custom LLM in DPO, the loss cannot divergence. Is the reason for the two are different?