verl icon indicating copy to clipboard operation
verl copied to clipboard

Using Megatron backend, OOM occurs when running the PPO of qwen25-32b model on 4-node H800

Open echo-valor opened this issue 11 months ago • 1 comments

  • The error is as follows:

Image

  • The training parameters are set as follows: python3 -m verl.trainer.main_ppo
    data.train_files=$HOME/train.parquet
    data.val_files=$HOME/test.parquet
    data.train_batch_size=32
    data.max_prompt_length=1024
    data.max_response_length=1024
    actor_rollout_ref.model.path=/private/online_llf/model/Qwen2.5-32B-Instruct
    actor_rollout_ref.model.enable_gradient_checkpointing=True
    actor_rollout_ref.actor.optim.lr=1e-6
    actor_rollout_ref.actor.ppo_mini_batch_size=2
    actor_rollout_ref.actor.ppo_micro_batch_size_per_gpu=4
    actor_rollout_ref.model.enable_gradient_checkpointing=True
    actor_rollout_ref.rollout.log_prob_micro_batch_size_per_gpu=8
    actor_rollout_ref.rollout.tensor_model_parallel_size=2
    actor_rollout_ref.rollout.name=vllm
    actor_rollout_ref.rollout.gpu_memory_utilization=0.3
    actor_rollout_ref.ref.log_prob_micro_batch_size_per_gpu=8
    critic.optim.lr=1e-5
    critic.model.path=/private/online_llf/model/Qwen2.5-32B-Instruct
    critic.model.enable_gradient_checkpointing=True
    critic.ppo_micro_batch_size_per_gpu=4
    algorithm.kl_ctrl.kl_coef=0.0001
    trainer.critic_warmup=0
    trainer.logger=['console']
    trainer.project_name='verl_example'
    trainer.experiment_name='Qwen2.5-32B-Instruct_function_rm'
    trainer.n_gpus_per_node=8
    trainer.nnodes=4
    trainer.default_local_dir=$output_dir
    trainer.save_freq=5
    trainer.test_freq=5
    trainer.total_epochs=15 $@

echo-valor avatar Feb 26 '25 03:02 echo-valor

could u try increasing the tensor model parallel / pipeline size in verl/trainer/config/ppo_megatron_trainer.yaml ?

eric-haibin-lin avatar Feb 26 '25 22:02 eric-haibin-lin