DeepSpeed [BUG] V100 32G，opt-1.5b，oom

training step3 come across oom problem,could you please help me to fix it,thx,I already tried enable parameter,but it doesn't work.

OutOfMemoryError: CUDA out of memory. Tried to allocate 198.00 MiB (GPU 0; 31.75 GiB total capacity; 28.05 GiB already allocated; 137.69 MiB free; 28.08 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF [2023-04-19 23:44:03,404] [INFO] [launch.py:428:sigkill_handler] Killing subprocess 8432 [2023-04-19 23:44:03,405] [ERROR] [launch.py:434:sigkill_handler] ['/data/anaconda3/bin/python', '-u', '/data/DeepSpeedExamples/applications/DeepSpeed-Chat/training/step3_rlhf_finetuning/main.py', '--local_rank=0', '--actor_model_name_or_path', '/data/DeepSpeedExamples/applications/DeepSpeed-Chat/training/step1_supervised_finetuning/training_scripts/single_gpu/output_sft', '--critic_model_name_or_path', '/data/DeepSpeedExamples/applications/DeepSpeed-Chat/training/step2_reward_model_finetuning/output_rw', '--num_padding_at_beginning', '1', '--gradient_accumulation_steps', '2', '--num_train_epochs', '1', '--ppo_epochs', '1', '--per_device_train_batch_size', '1', '--per_device_mini_train_batch_size', '1', '--actor_gradient_checkpointing', '--critic_gradient_checkpointing', '--deepspeed', '--output_dir', '/data/DeepSpeedExamples/applications/DeepSpeed-Chat/training/step3_rlhf_finetuning/output_rlhf'] exits with return code = 1

Describe the bug A clear and concise description of what the bug is.

To Reproduce Steps to reproduce the behavior:

Command/Script to reproduce
What packages are required and their versions
How to run the script
...

Expected behavior A clear and concise description of what you expected to happen.

ds_report output Please run ds_report to give us details about your setup.

Screenshots If applicable, add screenshots to help explain your problem.

System info (please complete the following information):

OS: [e.g. Ubuntu 18.04]
GPU count and types [e.g. two machines with x8 A100s each]
(if applicable) what DeepSpeed-MII version are you using
(if applicable) Hugging Face Transformers/Accelerate/etc. versions
Python version
Any other relevant info about your setup

Docker context Are you using a specific docker image that you can share?

Additional context Add any other context about the problem here.

Apr 20 '23 00:04 Modas-Li

QUOTE: (GPU 0; 31.75 GiB total capacity; 28.05 GiB already allocated; 137.69 MiB free; 28.08 GiB reserved in total by PyTorch)

I think it is something weird, because you have total memory of 31.75GB, but you only use 28.08GB, there are 3.67GB hasn't been used yet.

May 07 '23 12:05 zy-sunshine

@molly-smith, seems you started working on this issue, any update?

May 30 '23 05:05 xiexbing

@Modas-Li , are you using 1 GPU or multiple GPUs? If you're using 1 GPU, please try to increase the number of GPUs to see if it helps.

Aug 23 '23 18:08 xiexbing

the software issue is closed and will open again if necessary.

Sep 14 '23 19:09 xiexbing