lora-scripts icon indicating copy to clipboard operation
lora-scripts copied to clipboard

torch 2.3.1+cuda12.1 with the latest code, but got the train error when use single GPU, pls help

Open lilyzlt opened this issue 1 year ago • 1 comments

error

When I use 2 GPU to train flux lora, everything is fine, successful training~, but when I use one GPU or start with 2GPU, but use one, it start to have the error bellow, the code is latest commit ID: 1767dd101e63a50b159fe4ce55754caaf0078cb8; I tried : export NCCL_DEBUG=INFO export CUDA_DEVICE_ORDER="PCI_BUS_ID" export NCCL_IB_DISABLE=1

image image

Environment :

lion-pytorch 0.1.2 open-clip-torch 2.20.0 pytorch-lightning 1.9.0 torch 2.3.1+cu121 torchaudio 2.3.1+cu121 torchmetrics 1.4.1 torchvision 0.18.1+cu121 nvidia-nccl-cu12 2.20.5 centos 8 system image image

please help!

lilyzlt avatar Sep 02 '24 09:09 lilyzlt

if I use 2 GPUs, everything is fine: image image

lilyzlt avatar Sep 02 '24 09:09 lilyzlt

try set CUDA_VISIBLE_DEVICES

Akegarasu avatar Nov 05 '24 02:11 Akegarasu