verl
verl copied to clipboard
message histories in Agentic RL for reasoning model
System Info
MULTI A100 * 8 Nodes
Information
- [ ] The official example scripts
- [x] My own modified scripts
Tasks
- [ ] An officially supported task in the
examplesfolder (such as GLUE/SQuAD, ...) - [x] My own task or dataset (give details below)
Reproduction
Hello!
I have been implementing the Agentic RL for a reasoning model these days.
And the ideal way for me to call a reasoning model for multi-turn interaction would be
However, in the current "tool_agent_loop", I think the entire responses tokens (reasoning + answer) would be appended into the "prompt_ids", the same for various training-assisted masks.
I just would like to check whether I understand it correctly or not. Because I have been stuck at this issue for several days.
Expected behavior
As described above.