[BUG] V100 32G,opt-1.5b,oom
training step3 come across oom problem,could you please help me to fix it,thx,I already tried enable parameter,but it doesn't work.
OutOfMemoryError: CUDA out of memory. Tried to allocate 198.00 MiB (GPU 0; 31.75 GiB total capacity; 28.05 GiB already allocated; 137.69 MiB free; 28.08 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF [2023-04-19 23:44:03,404] [INFO] [launch.py:428:sigkill_handler] Killing subprocess 8432 [2023-04-19 23:44:03,405] [ERROR] [launch.py:434:sigkill_handler] ['/data/anaconda3/bin/python', '-u', '/data/DeepSpeedExamples/applications/DeepSpeed-Chat/training/step3_rlhf_finetuning/main.py', '--local_rank=0', '--actor_model_name_or_path', '/data/DeepSpeedExamples/applications/DeepSpeed-Chat/training/step1_supervised_finetuning/training_scripts/single_gpu/output_sft', '--critic_model_name_or_path', '/data/DeepSpeedExamples/applications/DeepSpeed-Chat/training/step2_reward_model_finetuning/output_rw', '--num_padding_at_beginning', '1', '--gradient_accumulation_steps', '2', '--num_train_epochs', '1', '--ppo_epochs', '1', '--per_device_train_batch_size', '1', '--per_device_mini_train_batch_size', '1', '--actor_gradient_checkpointing', '--critic_gradient_checkpointing', '--deepspeed', '--output_dir', '/data/DeepSpeedExamples/applications/DeepSpeed-Chat/training/step3_rlhf_finetuning/output_rlhf'] exits with return code = 1
Describe the bug A clear and concise description of what the bug is.
To Reproduce Steps to reproduce the behavior:
- Command/Script to reproduce
- What packages are required and their versions
- How to run the script
- ...
Expected behavior A clear and concise description of what you expected to happen.
ds_report output
Please run ds_report to give us details about your setup.
Screenshots If applicable, add screenshots to help explain your problem.
System info (please complete the following information):
- OS: [e.g. Ubuntu 18.04]
- GPU count and types [e.g. two machines with x8 A100s each]
- (if applicable) what DeepSpeed-MII version are you using
- (if applicable) Hugging Face Transformers/Accelerate/etc. versions
- Python version
- Any other relevant info about your setup
Docker context Are you using a specific docker image that you can share?
Additional context Add any other context about the problem here.
QUOTE: (GPU 0; 31.75 GiB total capacity; 28.05 GiB already allocated; 137.69 MiB free; 28.08 GiB reserved in total by PyTorch)
I think it is something weird, because you have total memory of 31.75GB, but you only use 28.08GB, there are 3.67GB hasn't been used yet.
@molly-smith, seems you started working on this issue, any update?
@Modas-Li , are you using 1 GPU or multiple GPUs? If you're using 1 GPU, please try to increase the number of GPUs to see if it helps.
the software issue is closed and will open again if necessary.