verl Fail to save megatron distributed checkpoints when using megatron as a back-end, and distributed checkpoint as actor model

when I load a Qwen3_235B model to RL training with a type of megatron distributed checkpoint, fail to save distributed checkpoints after several training steps. However, the process of saving model is successful without report any error, the saving model status final result as below:

it's only have a common.pt file, and other expected .distcp files disappear.

the parameters of RL tranining as follow,

Actually I've two questions, (1) How can I save the megatron distributed checkpoints ? (2) Can I save the model to transformers' type(HF type) directly when I using distributed checkpoint as actor model, if it's possible, What should I do?

Sep 01 '25 12:09 HillDing

You can use mbrige to save transformers' type checkpoint. You can refer to this script:https://github.com/volcengine/verl/blob/main/examples/grpo_trainer/run_qwen3-235b_megatron_96gb.sh#L96.

Sep 02 '25 01:09 techkang

You can use mbrige to save transformers' type checkpoint. You can refer to this script:https://github.com/volcengine/verl/blob/main/examples/grpo_trainer/run_qwen3-235b_megatron_96gb.sh#L96.

Thanks for your recommendation, I tried to use mbrige to save transformers' type, the training process was successful , however, the save process failed without any error reporting: the log shows saving successfully, But the saving path have not any checkpoints files

How to solve this promble?

Sep 03 '25 09:09 HillDing

It's related to mbridge. Maybe @ISEEKYAN can help.

Sep 03 '25 09:09 techkang

cc @ETOgaosion

Sep 04 '25 02:09 techkang

@HillDing Which version of verl and megatron you use? We recently fix the logic to use mbridge to save checkpoints, could you try to use new megatron_checkpoint_manager? Actually, after enabling use_mbridge, saving directory:

model weights: huggingface
optimizer states and others: dist_ckpt

Sep 04 '25 02:09 ETOgaosion

@HillDing Which version of verl and megatron you use? We recently fix the logic to use mbridge to save checkpoints, could you try to use new megatron_checkpoint_manager? Actually, after enabling use_mbridge, saving directory:你用的 verl 和 megatron 哪个版本？我们最近修复了使用 mbridge 保存检查点的逻辑，你能试试用新的 megatron_checkpoint_manager 吗？其实，启用 use_mbridge 后，保存目录：

model weights: huggingface模型权重： huggingface

optimizer states and others: dist_ckpt优化器状态和其他： dist_ckpt

请问huggingface这个路径下的模型权重，是微调之后的模型的权重嘛

Sep 10 '25 12:09 XQZZK

我也遇到相同的问题

Oct 16 '25 07:10 dtl123456

@HillDing Which version of verl and megatron you use? We recently fix the logic to use mbridge to save checkpoints, could you try to use new megatron_checkpoint_manager? Actually, after enabling use_mbridge, saving directory:

model weights: huggingface

optimizer states and others: dist_ckpt

thanks, I tried use_mbridge and save the checkpoint successfully! The model weights saves at huggingface and I wrong to check dist_ckpt before

Oct 17 '25 09:10 HillDing

你好，我使用mbridge保存qwen3-vl4b训练完的权重貌似只有

缺少了00002-000002

Nov 05 '25 12:11 jiangsongtao

@jiangsongtao 请对qwen3vl4b试试mbridge里的单元测试 https://github.com/ISEEKYAN/mbridge/blob/main/example/1.load_model_and_export_single_gpu.py ，看是否能完整导出，如果不能的话可以给mbridge提一个issue

Nov 06 '25 15:11 ISEEKYAN