DeepSpeed [BUG]RuntimeError: CUDA error: unknown error

Describe the bug Running an error

Log output ***** Evaluating perplexity, Epoch 0/1 ***** Traceback (most recent call last): File "main.py", line 345, in main() File "main.py", line 306, in main perplexity = evaluation(model, eval_dataloader) File "main.py", line 257, in evaluation outputs = model(**batch) File "/home/sh0an/anaconda3/envs/Chat/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl return forward_call(*args, **kwargs) File "/home/sh0an/anaconda3/envs/Chat/lib/python3.8/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn ret_val = func(*args, **kwargs) File "/home/sh0an/anaconda3/envs/Chat/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 1675, in forward loss = self.module(*inputs, **kwargs) File "/home/sh0an/anaconda3/envs/Chat/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl return forward_call(*args, **kwargs) File "/home/sh0an/anaconda3/envs/Chat/lib/python3.8/site-packages/transformers/models/opt/modeling_opt.py", line 950, in forward logits = self.lm_head(outputs[0]).contiguous() File "/home/sh0an/anaconda3/envs/Chat/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl return forward_call(*args, **kwargs) File "/home/sh0an/anaconda3/envs/Chat/lib/python3.8/site-packages/torch/nn/modules/linear.py", line 114, in forward return F.linear(input, self.weight, self.bias) RuntimeError: CUDA error: unknown error CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1. Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

[2023-04-28 09:46:00,441] [INFO] [launch.py:428:sigkill_handler] Killing subprocess 2469 [2023-04-28 09:46:00,442] [ERROR] [launch.py:434:sigkill_handler] ['/home/sh0an/anaconda3/envs/Chat/bin/python', '-u', 'main.py', '--local_rank=0', '--model_name_or_path', 'facebook/opt-1.3b', '--gradient_accumulation_steps', '8', '--lora_dim', '128', '--zero_stage', '0', '--deepspeed', '--output_dir', '/home/sh0an/DeepSpeedExamples/applications/DeepSpeed-Chat/output/actor-models/1.3b'] exits with return code = 1

To Reproduce Execute script: python train.py --actor-model facebook/opt-1.3b --reward-model facebook/opt-350m --deployment-type single_gpu

PyTorch: 2.0 CUDA: 11.8

System info (please complete the following information):

OS: Linux version 5.10.16.3-microsoft-standard-WSL2 (oe-user@oe-host) (x86_64-msft-linux-gcc (GCC) 9.3.0, GNU ld (GNU Binutils) 2.34.0.20200220) #1 SMP Fri Apr 2 22:23:49 UTC 2021
GPU: One RTX4070TI(12G)
Python version: 3.8

Apr 28 '23 02:04 SH0AN

RTX4070TI(12G) memory is not enough to train the ds-chat 1.3b model, I got this error before, because my GPU RTX3090 temperature is too high and GPU not working at that time.

May 07 '23 09:05 zy-sunshine

Hi @SH0AN, as @zy-sunshine mentioned, one 12G GPU is not enough for this task.

May 12 '23 18:05 molly-smith