Getting an Out of Memory (OOM) error when training a 7B model on 4x A100 80G GPUs using GRPO.
Getting an Out of Memory (OOM) error when training a 7B model on 4x A100 80G GPUs using GRPO. The error message and parameters are as follows. What is the reason?
----------Python Info---------- Version : 3.10.12 Compiler : GCC 11.4.0 Build : ('main', 'Jul 29 2024 16:56:48') Arch : ('64bit', 'ELF') ------------Pip Info----------- Version : 25.1.1 Directory : /usr/local/lib/python3.10/dist-packages/pip vllm : 0.10.0 sglang : not found. ray : 2.47.1 torch : 2.7.1 ----------verl Info----------- Version : 0.7.0.dev Directory : /workspace/verl/verl Error running git command: fatal: detected dubious ownership in repository at '/workspace/verl' To add an exception for this directory, call:
git config --global --add safe.directory /workspace/verl
Commit Hash : None ----------Platform Info---------- Platform : Linux-6.8.0-87-generic-x86_64-with-glibc2.35 system : Linux node : langchao-NF5468M6 release : 6.8.0-87-generic version : #88~22.04.1-Ubuntu SMP PREEMPT_DYNAMIC Tue Oct 14 14:03:14 UTC 2 ----------Environment---------- CUDA Runtime : 12.6 CUDA Compiler : Cuda compilation tools, release 12.6, V12.6.20 ----------System Info---------- CPU Memory : 62.52 GB GPU Count : 7 GPU 1 Type : NVIDIA RTX A6000 GPU 1 Memory : 47.99 GB GPU 2 Type : NVIDIA RTX A6000 GPU 2 Memory : 47.99 GB GPU 3 Type : NVIDIA A100 80GB PCIe GPU 3 Memory : 80.00 GB GPU 4 Type : NVIDIA A100 80GB PCIe GPU 4 Memory : 80.00 GB GPU 5 Type : NVIDIA A100 80GB PCIe GPU 5 Memory : 80.00 GB GPU 6 Type : NVIDIA A100 80GB PCIe GPU 6 Memory : 80.00 GB GPU 7 Type : NVIDIA A100 80GB PCIe GPU 7 Memory : 80.00 GB
This OOM error with 7B model on 4x A100 80GB is likely related to memory fragmentation and activation checkpointing configuration.
Quick diagnosis:
-
Reduce batch size: Try
data.train_batch_size=128or 64 -
Enable gradient checkpointing:
actor_rollout_ref.model.gradient_checkpointing=True -
Optimize tensor parallelism: Reduce
tensor_model_parallel_sizefrom 4 to 2 - Sequence length: Your 256 prompt + 256 response = 512 tokens per sample * 256 batch = high memory
Memory breakdown:
- 7B FP16 model: ~14GB
- Optimizer (AdamW): ~28GB
- Activations: 40-60GB (depends on sequence length)
Suggested config:
data.train_batch_size=128
actor_rollout_ref.model.gradient_checkpointing=True
actor_rollout_ref.model.use_remove_padding=True
This should bring peak memory < 70GB per GPU.
Are you suggesting I should modify enable_gradient_checkpointing? I've already tried lowering the batch size significantly, but it didn't help. I get the exact same error even when training a much smaller 0.6B model, and the issue is a system RAM OOM, not a GPU VRAM OOM. Could this be a memory leak issue? Also, I have another question: The directory where /tmp/ray is stored is over 95% full and is difficult to clean up. So, I want to know how to change Ray's temporary storage path.
This OOM error with 7B model on 4x A100 80GB is likely related to memory fragmentation and activation checkpointing configuration.
Quick diagnosis:
- Reduce batch size: Try
data.train_batch_size=128or 64- Enable gradient checkpointing:
actor_rollout_ref.model.gradient_checkpointing=True- Optimize tensor parallelism: Reduce
tensor_model_parallel_sizefrom 4 to 2- Sequence length: Your 256 prompt + 256 response = 512 tokens per sample * 256 batch = high memory
Memory breakdown:
- 7B FP16 model: ~14GB
- Optimizer (AdamW): ~28GB
- Activations: 40-60GB (depends on sequence length)
Suggested config:
data.train_batch_size=128 actor_rollout_ref.model.gradient_checkpointing=True actor_rollout_ref.model.use_remove_padding=TrueThis should bring peak memory < 70GB per GPU.
Are you suggesting I should modify enable_gradient_checkpointing? I've already tried lowering the batch size significantly, but it didn't help. I get the exact same error even when training a much smaller 0.6B model, and the issue is a system RAM OOM, not a GPU VRAM OOM. Could this be a memory leak issue? Also, I have another question: The directory where /tmp/ray is stored is over 95% full and is difficult to clean up. So, I want to know how to change Ray's temporary storage path.
This OOM error with 7B model on 4x A100 80GB is likely related to memory fragmentation and activation checkpointing configuration. Quick diagnosis:
- Reduce batch size: Try
data.train_batch_size=128or 64- Enable gradient checkpointing:
actor_rollout_ref.model.gradient_checkpointing=True- Optimize tensor parallelism: Reduce
tensor_model_parallel_sizefrom 4 to 2- Sequence length: Your 256 prompt + 256 response = 512 tokens per sample * 256 batch = high memory
Memory breakdown:
- 7B FP16 model: ~14GB
- Optimizer (AdamW): ~28GB
- Activations: 40-60GB (depends on sequence length)
Suggested config:
data.train_batch_size=128 actor_rollout_ref.model.gradient_checkpointing=True actor_rollout_ref.model.use_remove_padding=TrueThis should bring peak memory < 70GB per GPU.
You can try adding the following paths: export HF_DATASETS_CACHE="your_path/.cache/hf/hf_datasets" export RAY_TMPDIR="your_path/.cache/ray_tmp" export TMPDIR="your_path/.cache/system_tmp"
I am getting the same issue (on a different dataset), and it is CPU related. At some point, there is a memory spike, and training can go for 0.5 or 5 epochs before it happens. I suspect it is related to parallelization -- when running the same code on 4 instead of 8 gpus, with fewer dataloader workers, it seems to be able to go further.
@Gjl99 have the latest env variables fixed it for you?
可以尝试将使用KL散度设置为False,我试过能大量降低System Memory Utilization
actor_rollout_ref.actor.use_kl_loss=False
I am getting the same issue (on a different dataset), and it is CPU related. At some point, there is a memory spike, and training can go for 0.5 or 5 epochs before it happens. I suspect it is related to parallelization -- when running the same code on 4 instead of 8 gpus, with fewer dataloader workers, it seems to be able to go further.
@Gjl99 have the latest env variables fixed it for you?
Yes, I encountered the same issue as the original poster, and the latest environment setup helped me resolve it.