verl icon indicating copy to clipboard operation
verl copied to clipboard

Getting an Out of Memory (OOM) error when training a 7B model on 4x A100 80G GPUs using GRPO.

Open xxy33 opened this issue 5 months ago • 7 comments

Getting an Out of Memory (OOM) error when training a 7B model on 4x A100 80G GPUs using GRPO. The error message and parameters are as follows. What is the reason?

Image

Image

xxy33 avatar Nov 14 '25 01:11 xxy33

----------Python Info---------- Version : 3.10.12 Compiler : GCC 11.4.0 Build : ('main', 'Jul 29 2024 16:56:48') Arch : ('64bit', 'ELF') ------------Pip Info----------- Version : 25.1.1 Directory : /usr/local/lib/python3.10/dist-packages/pip vllm : 0.10.0 sglang : not found. ray : 2.47.1 torch : 2.7.1 ----------verl Info----------- Version : 0.7.0.dev Directory : /workspace/verl/verl Error running git command: fatal: detected dubious ownership in repository at '/workspace/verl' To add an exception for this directory, call:

    git config --global --add safe.directory /workspace/verl

Commit Hash : None ----------Platform Info---------- Platform : Linux-6.8.0-87-generic-x86_64-with-glibc2.35 system : Linux node : langchao-NF5468M6 release : 6.8.0-87-generic version : #88~22.04.1-Ubuntu SMP PREEMPT_DYNAMIC Tue Oct 14 14:03:14 UTC 2 ----------Environment---------- CUDA Runtime : 12.6 CUDA Compiler : Cuda compilation tools, release 12.6, V12.6.20 ----------System Info---------- CPU Memory : 62.52 GB GPU Count : 7 GPU 1 Type : NVIDIA RTX A6000 GPU 1 Memory : 47.99 GB GPU 2 Type : NVIDIA RTX A6000 GPU 2 Memory : 47.99 GB GPU 3 Type : NVIDIA A100 80GB PCIe GPU 3 Memory : 80.00 GB GPU 4 Type : NVIDIA A100 80GB PCIe GPU 4 Memory : 80.00 GB GPU 5 Type : NVIDIA A100 80GB PCIe GPU 5 Memory : 80.00 GB GPU 6 Type : NVIDIA A100 80GB PCIe GPU 6 Memory : 80.00 GB GPU 7 Type : NVIDIA A100 80GB PCIe GPU 7 Memory : 80.00 GB

xxy33 avatar Nov 14 '25 16:11 xxy33

This OOM error with 7B model on 4x A100 80GB is likely related to memory fragmentation and activation checkpointing configuration.

Quick diagnosis:

  1. Reduce batch size: Try data.train_batch_size=128 or 64
  2. Enable gradient checkpointing: actor_rollout_ref.model.gradient_checkpointing=True
  3. Optimize tensor parallelism: Reduce tensor_model_parallel_size from 4 to 2
  4. Sequence length: Your 256 prompt + 256 response = 512 tokens per sample * 256 batch = high memory

Memory breakdown:

  • 7B FP16 model: ~14GB
  • Optimizer (AdamW): ~28GB
  • Activations: 40-60GB (depends on sequence length)

Suggested config:

data.train_batch_size=128
actor_rollout_ref.model.gradient_checkpointing=True  
actor_rollout_ref.model.use_remove_padding=True

This should bring peak memory < 70GB per GPU.

shanto12 avatar Nov 15 '25 14:11 shanto12

Are you suggesting I should modify enable_gradient_checkpointing? I've already tried lowering the batch size significantly, but it didn't help. I get the exact same error even when training a much smaller 0.6B model, and the issue is a system RAM OOM, not a GPU VRAM OOM. Could this be a memory leak issue? Also, I have another question: The directory where /tmp/ray is stored is over 95% full and is difficult to clean up. So, I want to know how to change Ray's temporary storage path.

This OOM error with 7B model on 4x A100 80GB is likely related to memory fragmentation and activation checkpointing configuration.

Quick diagnosis:

  1. Reduce batch size: Try data.train_batch_size=128 or 64
  2. Enable gradient checkpointing: actor_rollout_ref.model.gradient_checkpointing=True
  3. Optimize tensor parallelism: Reduce tensor_model_parallel_size from 4 to 2
  4. Sequence length: Your 256 prompt + 256 response = 512 tokens per sample * 256 batch = high memory

Memory breakdown:

  • 7B FP16 model: ~14GB
  • Optimizer (AdamW): ~28GB
  • Activations: 40-60GB (depends on sequence length)

Suggested config:

data.train_batch_size=128
actor_rollout_ref.model.gradient_checkpointing=True  
actor_rollout_ref.model.use_remove_padding=True

This should bring peak memory < 70GB per GPU.

xxy33 avatar Nov 16 '25 03:11 xxy33

Are you suggesting I should modify enable_gradient_checkpointing? I've already tried lowering the batch size significantly, but it didn't help. I get the exact same error even when training a much smaller 0.6B model, and the issue is a system RAM OOM, not a GPU VRAM OOM. Could this be a memory leak issue? Also, I have another question: The directory where /tmp/ray is stored is over 95% full and is difficult to clean up. So, I want to know how to change Ray's temporary storage path.

This OOM error with 7B model on 4x A100 80GB is likely related to memory fragmentation and activation checkpointing configuration. Quick diagnosis:

  1. Reduce batch size: Try data.train_batch_size=128 or 64
  2. Enable gradient checkpointing: actor_rollout_ref.model.gradient_checkpointing=True
  3. Optimize tensor parallelism: Reduce tensor_model_parallel_size from 4 to 2
  4. Sequence length: Your 256 prompt + 256 response = 512 tokens per sample * 256 batch = high memory

Memory breakdown:

  • 7B FP16 model: ~14GB
  • Optimizer (AdamW): ~28GB
  • Activations: 40-60GB (depends on sequence length)

Suggested config:

data.train_batch_size=128
actor_rollout_ref.model.gradient_checkpointing=True  
actor_rollout_ref.model.use_remove_padding=True

This should bring peak memory < 70GB per GPU.

You can try adding the following paths: export HF_DATASETS_CACHE="your_path/.cache/hf/hf_datasets" export RAY_TMPDIR="your_path/.cache/ray_tmp" export TMPDIR="your_path/.cache/system_tmp"

Gjl99 avatar Nov 16 '25 05:11 Gjl99

I am getting the same issue (on a different dataset), and it is CPU related. At some point, there is a memory spike, and training can go for 0.5 or 5 epochs before it happens. I suspect it is related to parallelization -- when running the same code on 4 instead of 8 gpus, with fewer dataloader workers, it seems to be able to go further.

@Gjl99 have the latest env variables fixed it for you?

gm-kns avatar Nov 17 '25 16:11 gm-kns

可以尝试将使用KL散度设置为False,我试过能大量降低System Memory Utilization actor_rollout_ref.actor.use_kl_loss=False

scut-zx avatar Nov 18 '25 08:11 scut-zx

I am getting the same issue (on a different dataset), and it is CPU related. At some point, there is a memory spike, and training can go for 0.5 or 5 epochs before it happens. I suspect it is related to parallelization -- when running the same code on 4 instead of 8 gpus, with fewer dataloader workers, it seems to be able to go further.

@Gjl99 have the latest env variables fixed it for you?

Yes, I encountered the same issue as the original poster, and the latest environment setup helped me resolve it.

Gjl99 avatar Nov 18 '25 12:11 Gjl99