Qwen3 MOE model GRPO configs inconsistencies

Open RitvikKapila opened this issue 3 months ago • 0 comments

System Info

Hi, I was looking at the GRPO scripts for Qwen3 MOE models, particularly, examples/grpo_trainer/run_qwen3moe-30b_megatron_96gb.sh and examples/grpo_trainer/run_qwen3-235b_megatron_96gb.sh. There seem to be some inconsistencies and I wanted to flag them.

The use_kl_loss flag is set to False in the 30B-A3B model, even though the README explicitly states that for GRPO, this should be set to True, which is also the case for the 235B-A22B model. Similarly flag kl_loss_coef should be 0.001.
The max_response_length in the 235B-A22B model config is set to 1204 * 8 which is almost certainly wrong.
In the bash script to run the 30B-A3B model, the line number 38 should be there - TEST_FILE="['$aime24_test_path']"

I can open a PR to fix this if needed. Please let me know.

Information

[ ] The official example scripts
[ ] My own modified scripts

Tasks

[ ] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
[ ] My own task or dataset (give details below)

Reproduction

Bug in config

Expected behavior

Bug in config

Nov 06 '25 02:11 RitvikKapila