verl icon indicating copy to clipboard operation
verl copied to clipboard

Verl's megatron core_r0.11.0 backend successfully tested with 3D parallelism with multiple bug fixed

Open ETOgaosion opened this issue 11 months ago • 0 comments

This PR combines multiple modifications.

QWen2.5 checkpoint saver bug fix

Thanks for the efforts @uygnef contributed to #368 , we use the new saver for model loader and saver for 3D parallelism support.

Megatron backend 3D-parallelism test benches

We modify the scripts in examples/ppo_trainer and tests/e2e, as well as the CI workflows, all tested.

Bug Fix for 3D-parallelism

Including configuration bugs as well as the module packing.

Original TP VocabParallelEntropy can lead to CUDA OOM, we refactor the implementation with torch.bmm.

Fully migration to Megatron Core

Now we only use Megatron core in verl, fully get rid of calling other components. If they are in need, please integrate them into utils/megatron.

ETOgaosion avatar Mar 06 '25 07:03 ETOgaosion