torch 2.3.1+cuda12.1 with the latest code, but got the train error when use single GPU, pls help
error
When I use 2 GPU to train flux lora, everything is fine, successful training~, but when I use one GPU or start with 2GPU, but use one, it start to have the error bellow, the code is latest commit ID: 1767dd101e63a50b159fe4ce55754caaf0078cb8; I tried : export NCCL_DEBUG=INFO export CUDA_DEVICE_ORDER="PCI_BUS_ID" export NCCL_IB_DISABLE=1
Environment :
lion-pytorch 0.1.2
open-clip-torch 2.20.0
pytorch-lightning 1.9.0
torch 2.3.1+cu121
torchaudio 2.3.1+cu121
torchmetrics 1.4.1
torchvision 0.18.1+cu121
nvidia-nccl-cu12 2.20.5
centos 8 system
please help!
if I use 2 GPUs, everything is fine:
try set CUDA_VISIBLE_DEVICES