verl icon indicating copy to clipboard operation
verl copied to clipboard

How to set parameters to solve OOM!!!

Open asirgogogo opened this issue 11 months ago • 7 comments

When rollout.n is set to a larger value (e.g. 64, 128), on a 7b model with 4 GPUs, how should I configure the parameters to run it? I tried many solutions, but they all ended up with oom.

data.train_batch_size=4
data.val_batch_size=4
data.max_prompt_length=400
data.max_response_length=2048
actor_rollout_ref.model.path=$MODEL_PATH
actor_rollout_ref.actor.optim.lr=3e-7
actor_rollout_ref.model.use_remove_padding=True
actor_rollout_ref.actor.ppo_mini_batch_size=256
actor_rollout_ref.actor.ppo_micro_batch_size=64
actor_rollout_ref.actor.use_kl_loss=True
actor_rollout_ref.actor.kl_loss_coef=0.001
actor_rollout_ref.actor.kl_loss_type=low_var_kl
actor_rollout_ref.model.enable_gradient_checkpointing=True
actor_rollout_ref.actor.fsdp_config.param_offload=True
actor_rollout_ref.actor.fsdp_config.grad_offload=True
actor_rollout_ref.actor.fsdp_config.optimizer_offload=True
actor_rollout_ref.rollout.log_prob_micro_batch_size=160
actor_rollout_ref.rollout.tensor_model_parallel_size=1
actor_rollout_ref.rollout.name=vllm
actor_rollout_ref.rollout.gpu_memory_utilization=0.6
actor_rollout_ref.rollout.n=64
actor_rollout_ref.ref.log_prob_micro_batch_size=160
actor_rollout_ref.ref.fsdp_config.param_offload=True
algorithm.kl_ctrl.kl_coef=0.001 \

asirgogogo avatar Feb 20 '25 11:02 asirgogogo

@asirgogogo did you try set small actor_rollout_ref.actor.ppo_micro_batch_size?

uygnef avatar Feb 20 '25 11:02 uygnef

Sure, i set the "actor_rollout_ref.actor.ppo_micro_batch_size=2", OOM is still!!!

How can I train when I have no idea where to start?

asirgogogo avatar Feb 20 '25 12:02 asirgogogo

Some tips (might be helpful)

  1. Decrease actor_rollout_ref.rollout.n
  2. Ensure the setting export VLLM_ATTENTION_BACKEND=XFORMERS
  3. Decrease actor_rollout_ref.actor.ppo_micro_batch_size
  4. Decrease actor_rollout_ref.rollout.log_prob_micro_batch_size and actor_rollout_ref.ref.log_prob_micro_batch_size
  5. Decrease data.max_response_length

TissueC avatar Feb 20 '25 12:02 TissueC

cuda oom or cpu memory oom?

zui-jiang avatar Feb 21 '25 03:02 zui-jiang

cuda oom or cpu memory oom?

I always met cpu memory oom

TTtianTT avatar Mar 09 '25 14:03 TTtianTT

Some tips (might be helpful)

  1. Decrease actor_rollout_ref.rollout.n
  2. Ensure the setting export VLLM_ATTENTION_BACKEND=XFORMERS
  3. Decrease actor_rollout_ref.actor.ppo_micro_batch_size
  4. Decrease actor_rollout_ref.rollout.log_prob_micro_batch_size and actor_rollout_ref.ref.log_prob_micro_batch_size
  5. Decrease data.max_response_length

Cloud you elaborate a little bit more on how would these parameters affect the model performance?

kirklandWater1 avatar Mar 15 '25 21:03 kirklandWater1

Some tips (might be helpful)

  1. Decrease actor_rollout_ref.rollout.n
  2. Ensure the setting export VLLM_ATTENTION_BACKEND=XFORMERS
  3. Decrease actor_rollout_ref.actor.ppo_micro_batch_size
  4. Decrease actor_rollout_ref.rollout.log_prob_micro_batch_size and actor_rollout_ref.ref.log_prob_micro_batch_size
  5. Decrease data.max_response_length

Cloud you elaborate a little bit more on how would these parameters affect the model performance?

Looking at the document of those configurations and especially this visualization help me to understand its working mechanism. Image

https://excalidraw.com/#json=pfhkRmiLm1jnnRli9VFhb,Ut4E8peALlgAUpr7E5pPCA

HorHang avatar May 29 '25 02:05 HorHang