verl
verl copied to clipboard
Using Megatron backend, OOM occurs when running the PPO of qwen25-32b model on 4-node H800
- The error is as follows:
- The training parameters are set as follows:
python3 -m verl.trainer.main_ppo
data.train_files=$HOME/train.parquet
data.val_files=$HOME/test.parquet
data.train_batch_size=32
data.max_prompt_length=1024
data.max_response_length=1024
actor_rollout_ref.model.path=/private/online_llf/model/Qwen2.5-32B-Instruct
actor_rollout_ref.model.enable_gradient_checkpointing=True
actor_rollout_ref.actor.optim.lr=1e-6
actor_rollout_ref.actor.ppo_mini_batch_size=2
actor_rollout_ref.actor.ppo_micro_batch_size_per_gpu=4
actor_rollout_ref.model.enable_gradient_checkpointing=True
actor_rollout_ref.rollout.log_prob_micro_batch_size_per_gpu=8
actor_rollout_ref.rollout.tensor_model_parallel_size=2
actor_rollout_ref.rollout.name=vllm
actor_rollout_ref.rollout.gpu_memory_utilization=0.3
actor_rollout_ref.ref.log_prob_micro_batch_size_per_gpu=8
critic.optim.lr=1e-5
critic.model.path=/private/online_llf/model/Qwen2.5-32B-Instruct
critic.model.enable_gradient_checkpointing=True
critic.ppo_micro_batch_size_per_gpu=4
algorithm.kl_ctrl.kl_coef=0.0001
trainer.critic_warmup=0
trainer.logger=['console']
trainer.project_name='verl_example'
trainer.experiment_name='Qwen2.5-32B-Instruct_function_rm'
trainer.n_gpus_per_node=8
trainer.nnodes=4
trainer.default_local_dir=$output_dir
trainer.save_freq=5
trainer.test_freq=5
trainer.total_epochs=15 $@
could u try increasing the tensor model parallel / pipeline size in verl/trainer/config/ppo_megatron_trainer.yaml ?