verl
                                
                                
                                
                                    verl copied to clipboard
                            
                            
                            
                        How to generate preference pairs when generate_sequences only outputs one answer per prompt?
Thanks for your great work on this project! I’ve been exploring the code and I have a question regarding the data generation process in self.actor_rollout_wg.generate_sequences(gen_batch).
From what I can see, this function seems to generate only one output per prompt. However, for preference-based methods like DPO (Direct Preference Optimization), we typically need pairs of outputs (e.g., a “better” and “worse” answer) to form preference pairs for training.
Could you clarify:
Why does generate_sequences generate only one answer per prompt? How do you construct the preference pairs from these single outputs? Are you comparing across batches, sampling additional generations, or relying on some scoring mechanism? If I wanted to modify the code to generate multiple answers per prompt (for explicit pairwise comparison), where would be the best place to adjust? Thanks a lot for your help! Looking forward to your guidance.