direct-preference-optimization icon indicating copy to clipboard operation
direct-preference-optimization copied to clipboard

How to gurantee the output.logits.shape[:-1] == labels.shape

Open foreverhell opened this issue 1 year ago • 0 comments

dpo How to gurantee the two the same? When I train a custom LLM in DPO, the loss cannot divergence. Is the reason for the two are different?

foreverhell avatar Aug 13 '24 07:08 foreverhell