if rollout.n is doubled, will the samples used for training doubled too?
if rollout.n is doubled, will the samples used for training doubled too? And does each forward/backward pass incur higher GPU costs in training because of larger rollout.n?
Yes. Your understanding is correct.
Yes. Your understanding is correct.
Hi, thanks for your reply. So if I increase rollout.n for m times, should I decrease ppo_mini_batch_size or ppo_micro_batch_size to 1/m of original value?
Yes. Your understanding is correct.
Hi, thanks for your reply. So if I increase rollout.n for m times, should I decrease ppo_mini_batch_size or ppo_micro_batch_size to 1/m of original value?
I think ppo_mini_batch_size should also multiply by n. ppo_micro_batch_size should not because it only affects the gradient accumulation and increasing ppo_micro_batch_size may cause OOM
Yes. Your understanding is correct.
Hi, thanks for your reply. So if I increase rollout.n for m times, should I decrease ppo_mini_batch_size or ppo_micro_batch_size to 1/m of original value?
I think ppo_mini_batch_size should also multiply by n. ppo_micro_batch_size should not because it only affects the gradient accumulation and increasing ppo_micro_batch_size may cause OOM
ppo_micro_batch_size affect gradient accumulation, and what is the role of train_batch_size?
Yes. Your understanding is correct.
Hi, thanks for your reply. So if I increase rollout.n for m times, should I decrease ppo_mini_batch_size or ppo_micro_batch_size to 1/m of original value?
I think ppo_mini_batch_size should also multiply by n. ppo_micro_batch_size should not because it only affects the gradient accumulation and increasing ppo_micro_batch_size may cause OOM
ppo_micro_batch_size affect gradient accumulation, and what is the role of train_batch_size?
same question
same question
I have the same issue and I want to describe it in more detail. Here are the settings for GRPO in the example codes:
data.train_batch_size=1024 actor_rollout_ref.actor.ppo_mini_batch_size=256 actor_rollout_ref.actor.ppo_micro_batch_size_per_gpu=80 actor_rollout_ref.rollout.log_prob_micro_batch_size_per_gpu=160 actor_rollout_ref.ref.log_prob_micro_batch_size_per_gpu=160 actor_rollout_ref.rollout.n=5
-
I’d like to know whether data.train_batch_size = 1024 refers to #prompts, or #prompts * actor_rollout_ref.rollout.n. In other words,if batch data is 1024*5 or 1024?
-
actor_rollout_ref.actor.ppo_mini_batch_size=256, is it refers to #prompts or #prompts * actor_rollout_ref.rollout.n?
-
actor_rollout_ref.actor.ppo_micro_batch_size_per_gpu=80 cannot divide the actor_rollout_ref.actor.ppo_mini_batch_size=256 evenly — why is that? Is this a misconfiguration in the example code?
-
Does the value of actor_rollout_ref.actor.ppo_mini_batch_size=256 and actor_rollout_ref.actor.ppo_micro_batch_size_per_gpu = 80 consider actor_rollout_ref.rollout.n=5?What I mean is that how do I calculate my training steps?Is 10245 / 256 or 1024 / 256?And how do I calculate accumulated steps?Is 256 / 80 or 2565/80?
🤗 I would appreciate it a lot if anyone could help me clarify these questions. 🤗 🌻
same question