verifiers
verifiers copied to clipboard
Stepwise Advantages for Multi-Turn Training
Description
This PR doesn't change existing behavior, but it adds new options to the RLConfig:
-
use_stepwise_advantage- If True, treat each assistant turn as its own training sample and use a discounted return per step. -
stepwise_aggregation- How to compute discounted per-step return R_t from future rewards. -
stepwise_gamma- Discount factor gamma for previous turns.
It also adds a new parameter to MultiTurnEnv called exclude_think which will remove the CoT from previous turns. I think this only makes sense when using the stepwise advantage implementation.
The implementation is based on Kevin: Multi-Turn RL for Generating CUDA Kernels.
Type of Change
- [ ] Bug fix (non-breaking change which fixes an issue)
- [x] New feature (non-breaking change which adds functionality)
- [ ] Breaking change (fix or feature that would cause existing functionality to not work as expected)
- [ ] Documentation update
- [ ] Test improvement
Testing
- [x] All existing tests pass when running
uv run pytestlocally. - [x] New tests have been added to cover the changes
Checklist
- [x] My code follows the style guidelines of this project as outlined in AGENTS.md
- [x] I have performed a self-review of my own code
- [x] I have commented my code, particularly in hard-to-understand areas
- [x] I have made corresponding changes to the documentation
- [x] My changes generate no new warnings
- [x] Any dependent changes have been merged and published