verl
verl copied to clipboard
Verl's megatron core_r0.11.0 backend successfully tested with 3D parallelism with multiple bug fixed
This PR combines multiple modifications.
QWen2.5 checkpoint saver bug fix
Thanks for the efforts @uygnef contributed to #368 , we use the new saver for model loader and saver for 3D parallelism support.
Megatron backend 3D-parallelism test benches
We modify the scripts in examples/ppo_trainer and tests/e2e, as well as the CI workflows, all tested.
Bug Fix for 3D-parallelism
Including configuration bugs as well as the module packing.
Original TP VocabParallelEntropy can lead to CUDA OOM, we refactor the implementation with torch.bmm.
Fully migration to Megatron Core
Now we only use Megatron core in verl, fully get rid of calling other components. If they are in need, please integrate them into utils/megatron.