kang sheng

Results 19 comments of kang sheng

Thanks, I already created the pull request and tested on my machine.

You can use mbrige to save transformers' type checkpoint. You can refer to this script:https://github.com/volcengine/verl/blob/main/examples/grpo_trainer/run_qwen3-235b_megatron_96gb.sh#L96.

Can you test the recommanded script? https://github.com/volcengine/verl/blob/main/examples/grpo_trainer/run_qwen3moe-30b_megatron_96gb.sh

The script can also run on 4 H100 nodes, provided each node has sufficient CPU memory (>1.5 TB). We look forward to your feedback.

The `actor_rollout_ref.rollout.gpu_memory_utilization` is too high in your script. Please set to a lower value and test again. Maybe 0.7?

@ISEEKYAN @ETOgaosion Hello, do you have any ideas about this error?

@XQZZK Can you share with us what specific method you use/take?