verl
verl copied to clipboard
Hyperparameters with a greater impact on GPU usage
I would like to ask whether ppo_mini_batch_size and ppo_micro_batch_size have a big impact on GPU usage? And in the process of PPO training, which parameters have a greater impact on GPU usage? I tried adjusting max_response_length, train_batch_size and rollout.n, but it didn't seem to have much effect on reducing GPU usage. In fact, the gradient calculation on some special data was still using up the GPU memory.