Async-GRPO: Decouple rollout generation and reward computation/model update for faster training
Method description
Im not sure if it is better as a feature request or trainer addition, but i think it would be great if trl supported async-grpo. It is not mathematically equivalent to grpo since the policy updates may or may not lag, but it really doesn't matter. It effectively creates pipeline-parallelism in the GRPO workflow.
I tried to find related issues for this. https://github.com/huggingface/trl/issues/4130 refers to reward functions concurrent wrt each other but doesn't propose the decoupling.
NeMO RL supports this, it is described in the documentation as well. https://docs.nvidia.com/nemo/rl/latest/guides/async-grpo.html
I don't think it is supported in TRL as of yet, based on a brief check.
Cheers!
Open source status
- [x] The method implementation is available
- [ ] The model weights are available
- [x] The training datasets are available
Provide useful links for the implementation
https://docs.nvidia.com/nemo/rl/latest/guides/async-grpo.html https://github.com/NVIDIA-NeMo/RL
Indeed, it's not available yet, and at this point, I'm not sure if we will support it in the future. Keeping this issue in case this changes in the future