[BUG]RuntimeError: CUDA error: unknown error
Describe the bug Running an error
Log output
***** Evaluating perplexity, Epoch 0/1 *****
Traceback (most recent call last):
File "main.py", line 345, in TORCH_USE_CUDA_DSA to enable device-side assertions.
[2023-04-28 09:46:00,441] [INFO] [launch.py:428:sigkill_handler] Killing subprocess 2469 [2023-04-28 09:46:00,442] [ERROR] [launch.py:434:sigkill_handler] ['/home/sh0an/anaconda3/envs/Chat/bin/python', '-u', 'main.py', '--local_rank=0', '--model_name_or_path', 'facebook/opt-1.3b', '--gradient_accumulation_steps', '8', '--lora_dim', '128', '--zero_stage', '0', '--deepspeed', '--output_dir', '/home/sh0an/DeepSpeedExamples/applications/DeepSpeed-Chat/output/actor-models/1.3b'] exits with return code = 1
To Reproduce Execute script: python train.py --actor-model facebook/opt-1.3b --reward-model facebook/opt-350m --deployment-type single_gpu
PyTorch: 2.0 CUDA: 11.8
System info (please complete the following information):
- OS: Linux version 5.10.16.3-microsoft-standard-WSL2 (oe-user@oe-host) (x86_64-msft-linux-gcc (GCC) 9.3.0, GNU ld (GNU Binutils) 2.34.0.20200220) #1 SMP Fri Apr 2 22:23:49 UTC 2021
- GPU: One RTX4070TI(12G)
- Python version: 3.8
RTX4070TI(12G) memory is not enough to train the ds-chat 1.3b model, I got this error before, because my GPU RTX3090 temperature is too high and GPU not working at that time.
Hi @SH0AN, as @zy-sunshine mentioned, one 12G GPU is not enough for this task.