Fail to save megatron distributed checkpoints when using megatron as a back-end, and distributed checkpoint as actor model
when I load a Qwen3_235B model to RL training with a type of megatron distributed checkpoint, fail to save distributed checkpoints after several training steps. However, the process of saving model is successful without report any error, the saving model status final result as below:
the parameters of RL tranining as follow,
Actually I've two questions, (1) How can I save the megatron distributed checkpoints ? (2) Can I save the model to transformers' type(HF type) directly when I using distributed checkpoint as actor model, if it's possible, What should I do?
You can use mbrige to save transformers' type checkpoint. You can refer to this script:https://github.com/volcengine/verl/blob/main/examples/grpo_trainer/run_qwen3-235b_megatron_96gb.sh#L96.
You can use mbrige to save transformers' type checkpoint. You can refer to this script:https://github.com/volcengine/verl/blob/main/examples/grpo_trainer/run_qwen3-235b_megatron_96gb.sh#L96.
Thanks for your recommendation, I tried to use mbrige to save transformers' type, the training process was successful , however, the save process failed without any error reporting: the log shows saving successfully, But the saving path have not any checkpoints files
How to solve this promble?
It's related to mbridge. Maybe @ISEEKYAN can help.
cc @ETOgaosion
@HillDing Which version of verl and megatron you use? We recently fix the logic to use mbridge to save checkpoints, could you try to use new megatron_checkpoint_manager? Actually, after enabling use_mbridge, saving directory:
- model weights:
huggingface - optimizer states and others:
dist_ckpt
@HillDing Which version of verl and megatron you use? We recently fix the logic to use mbridge to save checkpoints, could you try to use new
megatron_checkpoint_manager? Actually, after enablinguse_mbridge, saving directory:你用的 verl 和 megatron 哪个版本?我们最近修复了使用 mbridge 保存检查点的逻辑,你能试试用新的megatron_checkpoint_manager吗?其实,启用use_mbridge后,保存目录:
- model weights:
huggingface模型权重:huggingface- optimizer states and others:
dist_ckpt优化器状态和其他:dist_ckpt
请问huggingface这个路径下的模型权重,是微调之后的模型的权重嘛
我也遇到相同的问题
@HillDing Which version of verl and megatron you use? We recently fix the logic to use mbridge to save checkpoints, could you try to use new
megatron_checkpoint_manager? Actually, after enablinguse_mbridge, saving directory:
- model weights:
huggingface- optimizer states and others:
dist_ckpt
thanks, I tried use_mbridge and save the checkpoint successfully! The model weights saves at huggingface and I wrong to check dist_ckpt before
你好,我使用mbridge保存qwen3-vl4b训练完的权重貌似只有
@jiangsongtao 请对qwen3vl4b试试mbridge里的单元测试 https://github.com/ISEEKYAN/mbridge/blob/main/example/1.load_model_and_export_single_gpu.py ,看是否能完整导出,如果不能的话可以给mbridge提一个issue