How to set parameters to solve OOM!!!
When rollout.n is set to a larger value (e.g. 64, 128), on a 7b model with 4 GPUs, how should I configure the parameters to run it? I tried many solutions, but they all ended up with oom.
data.train_batch_size=4
data.val_batch_size=4
data.max_prompt_length=400
data.max_response_length=2048
actor_rollout_ref.model.path=$MODEL_PATH
actor_rollout_ref.actor.optim.lr=3e-7
actor_rollout_ref.model.use_remove_padding=True
actor_rollout_ref.actor.ppo_mini_batch_size=256
actor_rollout_ref.actor.ppo_micro_batch_size=64
actor_rollout_ref.actor.use_kl_loss=True
actor_rollout_ref.actor.kl_loss_coef=0.001
actor_rollout_ref.actor.kl_loss_type=low_var_kl
actor_rollout_ref.model.enable_gradient_checkpointing=True
actor_rollout_ref.actor.fsdp_config.param_offload=True
actor_rollout_ref.actor.fsdp_config.grad_offload=True
actor_rollout_ref.actor.fsdp_config.optimizer_offload=True
actor_rollout_ref.rollout.log_prob_micro_batch_size=160
actor_rollout_ref.rollout.tensor_model_parallel_size=1
actor_rollout_ref.rollout.name=vllm
actor_rollout_ref.rollout.gpu_memory_utilization=0.6
actor_rollout_ref.rollout.n=64
actor_rollout_ref.ref.log_prob_micro_batch_size=160
actor_rollout_ref.ref.fsdp_config.param_offload=True
algorithm.kl_ctrl.kl_coef=0.001 \
@asirgogogo did you try set small actor_rollout_ref.actor.ppo_micro_batch_size?
Sure, i set the "actor_rollout_ref.actor.ppo_micro_batch_size=2", OOM is still!!!
How can I train when I have no idea where to start?
Some tips (might be helpful)
- Decrease
actor_rollout_ref.rollout.n - Ensure the setting
export VLLM_ATTENTION_BACKEND=XFORMERS - Decrease
actor_rollout_ref.actor.ppo_micro_batch_size - Decrease
actor_rollout_ref.rollout.log_prob_micro_batch_sizeandactor_rollout_ref.ref.log_prob_micro_batch_size - Decrease
data.max_response_length
cuda oom or cpu memory oom?
cuda oom or cpu memory oom?
I always met cpu memory oom
Some tips (might be helpful)
- Decrease
actor_rollout_ref.rollout.n- Ensure the setting
export VLLM_ATTENTION_BACKEND=XFORMERS- Decrease
actor_rollout_ref.actor.ppo_micro_batch_size- Decrease
actor_rollout_ref.rollout.log_prob_micro_batch_sizeandactor_rollout_ref.ref.log_prob_micro_batch_size- Decrease
data.max_response_length
Cloud you elaborate a little bit more on how would these parameters affect the model performance?
Some tips (might be helpful)
- Decrease
actor_rollout_ref.rollout.n- Ensure the setting
export VLLM_ATTENTION_BACKEND=XFORMERS- Decrease
actor_rollout_ref.actor.ppo_micro_batch_size- Decrease
actor_rollout_ref.rollout.log_prob_micro_batch_sizeandactor_rollout_ref.ref.log_prob_micro_batch_size- Decrease
data.max_response_lengthCloud you elaborate a little bit more on how would these parameters affect the model performance?
Looking at the document of those configurations and especially this visualization help me to understand its working mechanism.
https://excalidraw.com/#json=pfhkRmiLm1jnnRli9VFhb,Ut4E8peALlgAUpr7E5pPCA