trl
trl copied to clipboard
Different with online dpo papers
I see that the paper says that the Annotator can be adjusted through prompt. But the implementation of trl is score. Is this different from the paper?
Indeed, it's different from the paper for now as we will soon implement Online DPO with judge (ie, LLM annotator). The PR will be linked to this issue.