DeepSpeed [BUG] RuntimeError: CUDA error: out of memory

Describe the bug When I want to train my models, it crashed with this report:

(test_sam) forestbat@vm-jupyterhub-server:~/BELLE/train$ bash training_scripts/single_node/run_FT.sh
[2023-04-20 15:22:11,280] [INFO] [runner.py:540:main] cmd = /home/forestbat/.conda/envs/test_sam/bin/python3.10 -u -m deepspeed.launcher.launch --world_info=eyJsb2NhbGhvc3QiOiBbMCwgMV19 --master_addr=127.0.0.1 --master_port=29500 --enable_each_rank_log=None main.py --sft_only_data_path ./BELLE/train_2M_CN.json --model_name_or_path ./dalai/llama/models/7B/ggml-model-q4_0.bin --data_split 2,4,4 --per_device_train_batch_size 1 --per_device_eval_batch_size 2 --max_seq_len 512 --learning_rate 5e-6 --weight_decay 0.0001 --num_train_epochs 2 --gradient_accumulation_steps 8 --lr_scheduler_type cosine --num_warmup_steps 100 --seed 1234 --gradient_checkpointing --zero_stage 3 --deepspeed --output_dir ./output
[2023-04-20 15:22:13,293] [INFO] [launch.py:229:main] WORLD INFO DICT: {'localhost': [0, 1]}
[2023-04-20 15:22:13,294] [INFO] [launch.py:235:main] nnodes=1, num_local_procs=2, node_rank=0
[2023-04-20 15:22:13,294] [INFO] [launch.py:246:main] global_rank_mapping=defaultdict(<class 'list'>, {'localhost': [0, 1]})
[2023-04-20 15:22:13,294] [INFO] [launch.py:247:main] dist_world_size=2
[2023-04-20 15:22:13,294] [INFO] [launch.py:249:main] Setting CUDA_VISIBLE_DEVICES=0,1
[2023-04-20 15:22:16,054] [INFO] [comm.py:586:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl
Traceback (most recent call last):
  File "/home/forestbat/BELLE/train/main.py", line 364, in <module>
    main()
  File "/home/forestbat/BELLE/train/main.py", line 202, in main
    torch.distributed.barrier()
  File "/home/forestbat/.conda/envs/test_sam/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 3145, in barrier
    work = default_pg.barrier(opts=opts)
RuntimeError: CUDA error: out of memory
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
[2023-04-20 15:22:18,300] [INFO] [launch.py:428:sigkill_handler] Killing subprocess 2706470
[2023-04-20 15:22:18,301] [INFO] [launch.py:428:sigkill_handler] Killing subprocess 2706471
[2023-04-20 15:22:18,414] [ERROR] [launch.py:434:sigkill_handler] ['/home/forestbat/.conda/envs/test_sam/bin/python3.10', '-u', 'main.py', '--local_rank=1', '--sft_only_data_path', './BELLE/train_2M_CN.json', '--model_name_or_path', './dalai/llama/models/7B/ggml-model-q4_0.bin', '--data_split', '2,4,4', '--per_device_train_batch_size', '1', '--per_device_eval_batch_size', '2', '--max_seq_len', '512', '--learning_rate', '5e-6', '--weight_decay', '0.0001', '--num_train_epochs', '2', '--gradient_accumulation_steps', '8', '--lr_scheduler_type', 'cosine', '--num_warmup_steps', '100', '--seed', '1234', '--gradient_checkpointing', '--zero_stage', '3', '--deepspeed', '--output_dir', './output'] exits with return code = 1

I think my run_FT.sh is conservative enough，but it still crashed. Why？

ds_report output

--------------------------------------------------
DeepSpeed C++/CUDA extension op report
--------------------------------------------------
NOTE: Ops not installed will be just-in-time (JIT) compiled at
      runtime if needed. Op compatibility means that your system
      meet the required dependencies to JIT install the op.
--------------------------------------------------
JIT compiled ops requires ninja
ninja .................. [OKAY]
--------------------------------------------------
op name ................ installed .. compatible
--------------------------------------------------
 [WARNING]  async_io requires the dev libaio .so object and headers but these were not found.
 [WARNING]  async_io: please install the libaio-dev package with apt
 [WARNING]  If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.
async_io ............... [NO] ....... [NO]
cpu_adagrad ............ [NO] ....... [OKAY]
cpu_adam ............... [NO] ....... [OKAY]
fused_adam ............. [NO] ....... [OKAY]
fused_lamb ............. [NO] ....... [OKAY]
quantizer .............. [NO] ....... [OKAY]
random_ltd ............. [NO] ....... [OKAY]
 [WARNING]  please install triton==1.0.0 if you want to use sparse attention
sparse_attn ............ [NO] ....... [NO]
spatial_inference ...... [NO] ....... [OKAY]
transformer ............ [NO] ....... [OKAY]
stochastic_transformer . [NO] ....... [OKAY]
transformer_inference .. [NO] ....... [OKAY]
utils .................. [NO] ....... [OKAY]
--------------------------------------------------
DeepSpeed general environment info:
torch install path ............... ['/home/forestbat/.conda/envs/test_sam/lib/python3.10/site-packages/torch']
torch version .................... 1.13.1
deepspeed install path ........... ['/home/forestbat/.conda/envs/test_sam/lib/python3.10/site-packages/deepspeed']
deepspeed info ................... 0.9.0, unknown, unknown
torch cuda version ............... 11.7
torch hip version ................ None
nvcc version ..................... 11.7
deepspeed wheel compiled w. ...... torch 1.13, cuda 11.7

System info (please complete the following information):

OS: Ubuntu 20.04
GPU count and types: Nvidia A40*2
Python version: 3.9

Apr 20 '23 07:04 forestbat

Hi @forestbat, can you provide your run_FT.sh script or other reproducing script?

May 12 '23 18:05 molly-smith

Closing. Please reopen if the issue persists.

May 26 '23 18:05 molly-smith