kang sheng
kang sheng
Thanks, I already created the pull request and tested on my machine.
You can use mbrige to save transformers' type checkpoint. You can refer to this script:https://github.com/volcengine/verl/blob/main/examples/grpo_trainer/run_qwen3-235b_megatron_96gb.sh#L96.
It's related to mbridge. Maybe @ISEEKYAN can help.
cc @ETOgaosion
Can you test the recommanded script? https://github.com/volcengine/verl/blob/main/examples/grpo_trainer/run_qwen3moe-30b_megatron_96gb.sh
The script can also run on 4 H100 nodes, provided each node has sufficient CPU memory (>1.5 TB). We look forward to your feedback.
The `actor_rollout_ref.rollout.gpu_memory_utilization` is too high in your script. Please set to a lower value and test again. Maybe 0.7?
@ISEEKYAN @ETOgaosion Hello, do you have any ideas about this error?
@XQZZK Can you share with us what specific method you use/take?