verl
verl copied to clipboard
Qwen3 MOE model GRPO configs inconsistencies
System Info
Hi, I was looking at the GRPO scripts for Qwen3 MOE models, particularly, examples/grpo_trainer/run_qwen3moe-30b_megatron_96gb.sh and examples/grpo_trainer/run_qwen3-235b_megatron_96gb.sh. There seem to be some inconsistencies and I wanted to flag them.
- The
use_kl_lossflag is set toFalsein the30B-A3Bmodel, even though the README explicitly states that for GRPO, this should be set toTrue, which is also the case for the235B-A22Bmodel. Similarly flagkl_loss_coefshould be 0.001. - The
max_response_lengthin the235B-A22Bmodel config is set to1204 * 8which is almost certainly wrong. - In the bash script to run the
30B-A3Bmodel, the line number 38 should be there -TEST_FILE="['$aime24_test_path']"
I can open a PR to fix this if needed. Please let me know.
Information
- [ ] The official example scripts
- [ ] My own modified scripts
Tasks
- [ ] An officially supported task in the
examplesfolder (such as GLUE/SQuAD, ...) - [ ] My own task or dataset (give details below)
Reproduction
Bug in config
Expected behavior
Bug in config