diffusers LoRA error

Hi, I was experiencing this error frequently 4 among 10 trainings. Has anyone faced or know about this on how to get around?

CC: @patrickvonplaten @patil-suraj @cloneofsimo

02/19/2023 10:40:23 - INFO - main - ***** Running training ***** 02/19/2023 10:40:23 - INFO - main - Num examples = 4 02/19/2023 10:40:23 - INFO - main - Num batches each epoch = 4 02/19/2023 10:40:23 - INFO - main - Num Epochs = 500 02/19/2023 10:40:23 - INFO - main - Instantaneous batch size per device = 1 02/19/2023 10:40:23 - INFO - main - Total train batch size (w. parallel, distributed & accumulation) = 1 02/19/2023 10:40:23 - INFO - main - Gradient Accumulation steps = 1 02/19/2023 10:40:23 - INFO - main - Total optimization steps = 2000

0%| | 0/2000 [00:00<?, ?it/s] Steps: 0%| | 0/2000 [00:00<?, ?it/s]Traceback (most recent call last): File "/cr/j/eb8f7b574a904485bbf1929b0dc6b1de/exe/wd/train_dreambooth_lora.py", line 1016, in main(args) File "/cr/j/eb8f7b574a904485bbf1929b0dc6b1de/exe/wd/train_dreambooth_lora.py", line 873, in main model_pred = unet(noisy_latents, timesteps, encoder_hidden_states).sample File "/opt/conda/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl return forward_call(*input, **kwargs) File "/opt/conda/lib/python3.9/site-packages/diffusers/models/unet_2d_condition.py", line 481, in forward sample, res_samples = downsample_block( File "/opt/conda/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl return forward_call(*input, **kwargs) File "/opt/conda/lib/python3.9/site-packages/diffusers/models/unet_2d_blocks.py", line 789, in forward hidden_states = attn( File "/opt/conda/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl return forward_call(*input, **kwargs) File "/opt/conda/lib/python3.9/site-packages/diffusers/models/transformer_2d.py", line 257, in forward hidden_states = self.proj_in(hidden_states) File "/opt/conda/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl return forward_call(*input, **kwargs) File "/opt/conda/lib/python3.9/site-packages/torch/nn/modules/linear.py", line 114, in forward return F.linear(input, self.weight, self.bias) RuntimeError: CUDA error: CUBLAS_STATUS_EXECUTION_FAILED when calling cublasSgemm( handle, opa, opb, m, n, k, &alpha, a, lda, b, ldb, &beta, c, ldc)

Feb 21 '23 09:02 polavishnu4444

Hi @polavishnu4444! Can you please run diffusers-cli env and send us the output? Please, let us know the GPU(s) you are running the process on. Thanks a lot!

Feb 21 '23 10:02 pcuenca

@pcuenca - we are using A100 80GB machine where this error is seen.

huggingface_hub version: 0.12.1
Platform: Linux-5.4.0-1063-azure-x86_64-with-glibc2.27
Python version: 3.9.12
Running in iPython ?: No
Running in notebook ?: No
Running in Google Colab ?: No
Token path ?: ~/.cache/huggingface/token
Has saved token ?: False
Configured git credential helpers:
FastAI: N/A
Tensorflow: N/A
Torch: 1.13.0+cu116
Jinja2: 3.1.2
Graphviz: N/A
Pydot: N/A
Pillow: 9.4.0
hf_transfer: N/A
ENDPOINT: https://huggingface.co
HUGGINGFACE_HUB_CACHE: ~/.cache/huggingface/hub
HUGGINGFACE_ASSETS_CACHE: ~/.cache/huggingface/assets
HF_HUB_OFFLINE: False
HF_TOKEN_PATH: ~/.cache/huggingface/token
HF_HUB_DISABLE_PROGRESS_BARS: None
HF_HUB_DISABLE_SYMLINKS_WARNING: False
HF_HUB_DISABLE_IMPLICIT_TOKEN: False
HF_HUB_ENABLE_HF_TRANSFER: False

Feb 21 '23 22:02 polavishnu4444

@pcuenca - Curious if you were able to get any sense out of the error and the issue??

Feb 23 '23 14:02 polavishnu4444

Gently ping @pcuenca

Mar 28 '23 12:03 patrickvonplaten

Hi @polavishnu4444, sorry for the delay :( I don't really know why this would happen sporadically, and I haven't been able to reproduce. I assume you are using the unmodified version of the train_dreambooth_lora script, is that correct? Are you still seeing this with PyTorch 1.13.1 (instead of 1.13.0) and the latest version of diffusers?

Maybe @patil-suraj or @sayakpaul can provide some insight?

Mar 29 '23 09:03 pcuenca

Very much seems like an enviornment setup problem to me, and I couldn't reproduce it on my end either (PyTorch 1.13, CUDA 11.6).

Mar 29 '23 10:03 sayakpaul

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

Apr 23 '23 15:04 github-actions[bot]