verl How to set parameters to solve OOM！！！

When rollout.n is set to a larger value (e.g. 64, 128), on a 7b model with 4 GPUs, how should I configure the parameters to run it? I tried many solutions, but they all ended up with oom.

data.train_batch_size=4
data.val_batch_size=4
data.max_prompt_length=400
data.max_response_length=2048
actor_rollout_ref.model.path=$MODEL_PATH
actor_rollout_ref.actor.optim.lr=3e-7
actor_rollout_ref.model.use_remove_padding=True
actor_rollout_ref.actor.ppo_mini_batch_size=256
actor_rollout_ref.actor.ppo_micro_batch_size=64
actor_rollout_ref.actor.use_kl_loss=True
actor_rollout_ref.actor.kl_loss_coef=0.001
actor_rollout_ref.actor.kl_loss_type=low_var_kl
actor_rollout_ref.model.enable_gradient_checkpointing=True
actor_rollout_ref.actor.fsdp_config.param_offload=True
actor_rollout_ref.actor.fsdp_config.grad_offload=True
actor_rollout_ref.actor.fsdp_config.optimizer_offload=True
actor_rollout_ref.rollout.log_prob_micro_batch_size=160
actor_rollout_ref.rollout.tensor_model_parallel_size=1
actor_rollout_ref.rollout.name=vllm
actor_rollout_ref.rollout.gpu_memory_utilization=0.6
actor_rollout_ref.rollout.n=64
actor_rollout_ref.ref.log_prob_micro_batch_size=160
actor_rollout_ref.ref.fsdp_config.param_offload=True
algorithm.kl_ctrl.kl_coef=0.001 \

Feb 20 '25 11:02 asirgogogo

@asirgogogo did you try set small actor_rollout_ref.actor.ppo_micro_batch_size？

Feb 20 '25 11:02 uygnef

Sure, i set the "actor_rollout_ref.actor.ppo_micro_batch_size=2", OOM is still!!!

How can I train when I have no idea where to start?

Feb 20 '25 12:02 asirgogogo

Some tips (might be helpful)

Decrease actor_rollout_ref.rollout.n
Ensure the setting export VLLM_ATTENTION_BACKEND=XFORMERS
Decrease actor_rollout_ref.actor.ppo_micro_batch_size
Decrease actor_rollout_ref.rollout.log_prob_micro_batch_size and actor_rollout_ref.ref.log_prob_micro_batch_size
Decrease data.max_response_length

Feb 20 '25 12:02 TissueC

cuda oom or cpu memory oom?

Feb 21 '25 03:02 zui-jiang

cuda oom or cpu memory oom?

I always met cpu memory oom

Mar 09 '25 14:03 TTtianTT

Some tips (might be helpful)

Decrease actor_rollout_ref.rollout.n

Ensure the setting export VLLM_ATTENTION_BACKEND=XFORMERS

Decrease actor_rollout_ref.actor.ppo_micro_batch_size

Decrease actor_rollout_ref.rollout.log_prob_micro_batch_size and actor_rollout_ref.ref.log_prob_micro_batch_size

Decrease data.max_response_length

Cloud you elaborate a little bit more on how would these parameters affect the model performance?

Mar 15 '25 21:03 kirklandWater1

Some tips (might be helpful)

Decrease actor_rollout_ref.rollout.n

Ensure the setting export VLLM_ATTENTION_BACKEND=XFORMERS

Decrease actor_rollout_ref.actor.ppo_micro_batch_size

Decrease actor_rollout_ref.rollout.log_prob_micro_batch_size and actor_rollout_ref.ref.log_prob_micro_batch_size

Decrease data.max_response_length

Cloud you elaborate a little bit more on how would these parameters affect the model performance?

Looking at the document of those configurations and especially this visualization help me to understand its working mechanism.

https://excalidraw.com/#json=pfhkRmiLm1jnnRli9VFhb,Ut4E8peALlgAUpr7E5pPCA

May 29 '25 02:05 HorHang