verl icon indicating copy to clipboard operation
verl copied to clipboard

Fail to save megatron distributed checkpoints when using megatron as a back-end, and distributed checkpoint as actor model

Open HillDing opened this issue 5 months ago • 10 comments

when I load a Qwen3_235B model to RL training with a type of megatron distributed checkpoint, fail to save distributed checkpoints after several training steps. However, the process of saving model is successful without report any error, the saving model status final result as below:

Image it's only have a common.pt file, and other expected .distcp files disappear.

the parameters of RL tranining as follow,

Image

Actually I've two questions, (1) How can I save the megatron distributed checkpoints ? (2) Can I save the model to transformers' type(HF type) directly when I using distributed checkpoint as actor model, if it's possible, What should I do?

HillDing avatar Sep 01 '25 12:09 HillDing

You can use mbrige to save transformers' type checkpoint. You can refer to this script:https://github.com/volcengine/verl/blob/main/examples/grpo_trainer/run_qwen3-235b_megatron_96gb.sh#L96.

techkang avatar Sep 02 '25 01:09 techkang

You can use mbrige to save transformers' type checkpoint. You can refer to this script:https://github.com/volcengine/verl/blob/main/examples/grpo_trainer/run_qwen3-235b_megatron_96gb.sh#L96.

Thanks for your recommendation, I tried to use mbrige to save transformers' type, the training process was successful , however, the save process failed without any error reporting: the log shows saving successfully, But the saving path have not any checkpoints files

Image Image

How to solve this promble?

HillDing avatar Sep 03 '25 09:09 HillDing

It's related to mbridge. Maybe @ISEEKYAN can help.

techkang avatar Sep 03 '25 09:09 techkang

cc @ETOgaosion

techkang avatar Sep 04 '25 02:09 techkang

@HillDing Which version of verl and megatron you use? We recently fix the logic to use mbridge to save checkpoints, could you try to use new megatron_checkpoint_manager? Actually, after enabling use_mbridge, saving directory:

  • model weights: huggingface
  • optimizer states and others: dist_ckpt

ETOgaosion avatar Sep 04 '25 02:09 ETOgaosion

@HillDing Which version of verl and megatron you use? We recently fix the logic to use mbridge to save checkpoints, could you try to use new megatron_checkpoint_manager? Actually, after enabling use_mbridge, saving directory:你用的 verl 和 megatron 哪个版本?我们最近修复了使用 mbridge 保存检查点的逻辑,你能试试用新的 megatron_checkpoint_manager 吗?其实,启用 use_mbridge 后,保存目录:

  • model weights: huggingface模型权重: huggingface
  • optimizer states and others: dist_ckpt优化器状态和其他: dist_ckpt

请问huggingface这个路径下的模型权重,是微调之后的模型的权重嘛

XQZZK avatar Sep 10 '25 12:09 XQZZK

我也遇到相同的问题

dtl123456 avatar Oct 16 '25 07:10 dtl123456

@HillDing Which version of verl and megatron you use? We recently fix the logic to use mbridge to save checkpoints, could you try to use new megatron_checkpoint_manager? Actually, after enabling use_mbridge, saving directory:

  • model weights: huggingface
  • optimizer states and others: dist_ckpt

thanks, I tried use_mbridge and save the checkpoint successfully! The model weights saves at huggingface and I wrong to check dist_ckpt before

HillDing avatar Oct 17 '25 09:10 HillDing

你好,我使用mbridge保存qwen3-vl4b训练完的权重貌似只有

Image 缺少了00002-000002

jiangsongtao avatar Nov 05 '25 12:11 jiangsongtao

@jiangsongtao 请对qwen3vl4b试试mbridge里的单元测试 https://github.com/ISEEKYAN/mbridge/blob/main/example/1.load_model_and_export_single_gpu.py ,看是否能完整导出,如果不能的话可以给mbridge提一个issue

ISEEKYAN avatar Nov 06 '25 15:11 ISEEKYAN