Stepwise Advantages for Multi-Turn Training

Open kyleavery opened this issue 2 months ago • 0 comments

Description

This PR doesn't change existing behavior, but it adds new options to the RLConfig:

use_stepwise_advantage - If True, treat each assistant turn as its own training sample and use a discounted return per step.
stepwise_aggregation - How to compute discounted per-step return R_t from future rewards.
stepwise_gamma - Discount factor gamma for previous turns.

It also adds a new parameter to MultiTurnEnv called exclude_think which will remove the CoT from previous turns. I think this only makes sense when using the stepwise advantage implementation.

The implementation is based on Kevin: Multi-Turn RL for Generating CUDA Kernels.

Type of Change

[ ] Bug fix (non-breaking change which fixes an issue)
[x] New feature (non-breaking change which adds functionality)
[ ] Breaking change (fix or feature that would cause existing functionality to not work as expected)
[ ] Documentation update
[ ] Test improvement

Testing

[x] All existing tests pass when running uv run pytest locally.
[x] New tests have been added to cover the changes

Checklist

[x] My code follows the style guidelines of this project as outlined in AGENTS.md
[x] I have performed a self-review of my own code
[x] I have commented my code, particularly in hard-to-understand areas
[x] I have made corresponding changes to the documentation
[x] My changes generate no new warnings
[x] Any dependent changes have been merged and published

Additional Notes

Nov 05 '25 19:11 kyleavery