verifiers icon indicating copy to clipboard operation
verifiers copied to clipboard

Stepwise Advantages for Multi-Turn Training

Open kyleavery opened this issue 2 months ago • 0 comments

Description

This PR doesn't change existing behavior, but it adds new options to the RLConfig:

  • use_stepwise_advantage - If True, treat each assistant turn as its own training sample and use a discounted return per step.
  • stepwise_aggregation - How to compute discounted per-step return R_t from future rewards.
  • stepwise_gamma - Discount factor gamma for previous turns.

It also adds a new parameter to MultiTurnEnv called exclude_think which will remove the CoT from previous turns. I think this only makes sense when using the stepwise advantage implementation.

The implementation is based on Kevin: Multi-Turn RL for Generating CUDA Kernels.

Type of Change

  • [ ] Bug fix (non-breaking change which fixes an issue)
  • [x] New feature (non-breaking change which adds functionality)
  • [ ] Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • [ ] Documentation update
  • [ ] Test improvement

Testing

  • [x] All existing tests pass when running uv run pytest locally.
  • [x] New tests have been added to cover the changes

Checklist

  • [x] My code follows the style guidelines of this project as outlined in AGENTS.md
  • [x] I have performed a self-review of my own code
  • [x] I have commented my code, particularly in hard-to-understand areas
  • [x] I have made corresponding changes to the documentation
  • [x] My changes generate no new warnings
  • [x] Any dependent changes have been merged and published

Additional Notes

kyleavery avatar Nov 05 '25 19:11 kyleavery