verl verl是否支持使用LLM作为交互式环境进行multi-turn rl? | Does verl support multi-turn rl using LLM as an interactive environment?

首先，非常感谢你们开发了 Verl 这个优秀的框架，它为基于LLM的强化学习研究提供了极大的便利。我目前正在探索一个特定的应用场景，想咨询一下Verl框架目前是否支持该功能，或者是否有推荐的实现方法。

我希望通过强化学习来训练一个智能体LLM，以提升其多轮对话能力。我设想的交互环境不是代码解释器或游戏，而是另一个LLM，用它来模拟真实用户，这样来实现和用户的多轮对话来完成用户指定的某些任务。我在multi turn的示例中找到了关于GSM8K下或者带有工具调用的multi turn rl，但我不太确定和另一个llm进行交互是否是被支持的。

如果有相似的issue请指出，issue太多我暂时并没有找到有类似问题的情况，所以这里放一个issue想咨询下，感谢verl社区的同学。

【English】： First of all, thank you very much for developing the excellent Verl framework, which greatly facilitates research in reinforcement learning with LLMs.

I'm currently exploring a specific application scenario and would like to ask if the Verl framework currently supports this functionality or if there are any recommended implementation methods.

I'd like to train an LLM through reinforcement learning to improve its multi-turn conversational capabilities. The interaction environment I envision isn't a code interpreter or game, but rather another LLM that simulates a real user, enabling multi-turn conversations with the user to complete certain user-specified tasks.

I found references to multi-turn RL under GSM8K or with tool calls in the multi-turn examples, but I'm not sure whether interacting with another LLM is supported.

If you have similar issues, please let me know. There are so many issues that I haven't found any similar cases, so I'm posting this here to discuss. Thank you to the Verl community.