verl icon indicating copy to clipboard operation
verl copied to clipboard

How to generate preference pairs when generate_sequences only outputs one answer per prompt?

Open TianL123 opened this issue 5 months ago • 0 comments

Thanks for your great work on this project! I’ve been exploring the code and I have a question regarding the data generation process in self.actor_rollout_wg.generate_sequences(gen_batch).

From what I can see, this function seems to generate only one output per prompt. However, for preference-based methods like DPO (Direct Preference Optimization), we typically need pairs of outputs (e.g., a “better” and “worse” answer) to form preference pairs for training.

Could you clarify:

Why does generate_sequences generate only one answer per prompt? How do you construct the preference pairs from these single outputs? Are you comparing across batches, sampling additional generations, or relying on some scoring mechanism? If I wanted to modify the code to generate multiple answers per prompt (for explicit pairwise comparison), where would be the best place to adjust? Thanks a lot for your help! Looking forward to your guidance.

TianL123 avatar May 26 '25 12:05 TianL123