diffusers icon indicating copy to clipboard operation
diffusers copied to clipboard

LoRA error

Open polavishnu4444 opened this issue 2 years ago • 1 comments

Hi, I was experiencing this error frequently 4 among 10 trainings. Has anyone faced or know about this on how to get around?

CC: @patrickvonplaten @patil-suraj @cloneofsimo

02/19/2023 10:40:23 - INFO - main - ***** Running training ***** 02/19/2023 10:40:23 - INFO - main - Num examples = 4 02/19/2023 10:40:23 - INFO - main - Num batches each epoch = 4 02/19/2023 10:40:23 - INFO - main - Num Epochs = 500 02/19/2023 10:40:23 - INFO - main - Instantaneous batch size per device = 1 02/19/2023 10:40:23 - INFO - main - Total train batch size (w. parallel, distributed & accumulation) = 1 02/19/2023 10:40:23 - INFO - main - Gradient Accumulation steps = 1 02/19/2023 10:40:23 - INFO - main - Total optimization steps = 2000

0%| | 0/2000 [00:00<?, ?it/s] Steps: 0%| | 0/2000 [00:00<?, ?it/s]Traceback (most recent call last): File "/cr/j/eb8f7b574a904485bbf1929b0dc6b1de/exe/wd/train_dreambooth_lora.py", line 1016, in main(args) File "/cr/j/eb8f7b574a904485bbf1929b0dc6b1de/exe/wd/train_dreambooth_lora.py", line 873, in main model_pred = unet(noisy_latents, timesteps, encoder_hidden_states).sample File "/opt/conda/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl return forward_call(*input, **kwargs) File "/opt/conda/lib/python3.9/site-packages/diffusers/models/unet_2d_condition.py", line 481, in forward sample, res_samples = downsample_block( File "/opt/conda/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl return forward_call(*input, **kwargs) File "/opt/conda/lib/python3.9/site-packages/diffusers/models/unet_2d_blocks.py", line 789, in forward hidden_states = attn( File "/opt/conda/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl return forward_call(*input, **kwargs) File "/opt/conda/lib/python3.9/site-packages/diffusers/models/transformer_2d.py", line 257, in forward hidden_states = self.proj_in(hidden_states) File "/opt/conda/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl return forward_call(*input, **kwargs) File "/opt/conda/lib/python3.9/site-packages/torch/nn/modules/linear.py", line 114, in forward return F.linear(input, self.weight, self.bias) RuntimeError: CUDA error: CUBLAS_STATUS_EXECUTION_FAILED when calling cublasSgemm( handle, opa, opb, m, n, k, &alpha, a, lda, b, ldb, &beta, c, ldc)

polavishnu4444 avatar Feb 21 '23 09:02 polavishnu4444

Hi @polavishnu4444! Can you please run diffusers-cli env and send us the output? Please, let us know the GPU(s) you are running the process on. Thanks a lot!

pcuenca avatar Feb 21 '23 10:02 pcuenca

@pcuenca - we are using A100 80GB machine where this error is seen.

  • huggingface_hub version: 0.12.1
  • Platform: Linux-5.4.0-1063-azure-x86_64-with-glibc2.27
  • Python version: 3.9.12
  • Running in iPython ?: No
  • Running in notebook ?: No
  • Running in Google Colab ?: No
  • Token path ?: ~/.cache/huggingface/token
  • Has saved token ?: False
  • Configured git credential helpers:
  • FastAI: N/A
  • Tensorflow: N/A
  • Torch: 1.13.0+cu116
  • Jinja2: 3.1.2
  • Graphviz: N/A
  • Pydot: N/A
  • Pillow: 9.4.0
  • hf_transfer: N/A
  • ENDPOINT: https://huggingface.co
  • HUGGINGFACE_HUB_CACHE: ~/.cache/huggingface/hub
  • HUGGINGFACE_ASSETS_CACHE: ~/.cache/huggingface/assets
  • HF_HUB_OFFLINE: False
  • HF_TOKEN_PATH: ~/.cache/huggingface/token
  • HF_HUB_DISABLE_PROGRESS_BARS: None
  • HF_HUB_DISABLE_SYMLINKS_WARNING: False
  • HF_HUB_DISABLE_IMPLICIT_TOKEN: False
  • HF_HUB_ENABLE_HF_TRANSFER: False

polavishnu4444 avatar Feb 21 '23 22:02 polavishnu4444

@pcuenca - Curious if you were able to get any sense out of the error and the issue??

polavishnu4444 avatar Feb 23 '23 14:02 polavishnu4444

Gently ping @pcuenca

patrickvonplaten avatar Mar 28 '23 12:03 patrickvonplaten

Hi @polavishnu4444, sorry for the delay :( I don't really know why this would happen sporadically, and I haven't been able to reproduce. I assume you are using the unmodified version of the train_dreambooth_lora script, is that correct? Are you still seeing this with PyTorch 1.13.1 (instead of 1.13.0) and the latest version of diffusers?

Maybe @patil-suraj or @sayakpaul can provide some insight?

pcuenca avatar Mar 29 '23 09:03 pcuenca

Very much seems like an enviornment setup problem to me, and I couldn't reproduce it on my end either (PyTorch 1.13, CUDA 11.6).

sayakpaul avatar Mar 29 '23 10:03 sayakpaul

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

github-actions[bot] avatar Apr 23 '23 15:04 github-actions[bot]