LoRA error
Hi, I was experiencing this error frequently 4 among 10 trainings. Has anyone faced or know about this on how to get around?
CC: @patrickvonplaten @patil-suraj @cloneofsimo
02/19/2023 10:40:23 - INFO - main - ***** Running training ***** 02/19/2023 10:40:23 - INFO - main - Num examples = 4 02/19/2023 10:40:23 - INFO - main - Num batches each epoch = 4 02/19/2023 10:40:23 - INFO - main - Num Epochs = 500 02/19/2023 10:40:23 - INFO - main - Instantaneous batch size per device = 1 02/19/2023 10:40:23 - INFO - main - Total train batch size (w. parallel, distributed & accumulation) = 1 02/19/2023 10:40:23 - INFO - main - Gradient Accumulation steps = 1 02/19/2023 10:40:23 - INFO - main - Total optimization steps = 2000
0%| | 0/2000 [00:00<?, ?it/s]
Steps: 0%| | 0/2000 [00:00<?, ?it/s]Traceback (most recent call last):
File "/cr/j/eb8f7b574a904485bbf1929b0dc6b1de/exe/wd/train_dreambooth_lora.py", line 1016, in cublasSgemm( handle, opa, opb, m, n, k, &alpha, a, lda, b, ldb, &beta, c, ldc)
Hi @polavishnu4444! Can you please run diffusers-cli env and send us the output? Please, let us know the GPU(s) you are running the process on. Thanks a lot!
@pcuenca - we are using A100 80GB machine where this error is seen.
- huggingface_hub version: 0.12.1
- Platform: Linux-5.4.0-1063-azure-x86_64-with-glibc2.27
- Python version: 3.9.12
- Running in iPython ?: No
- Running in notebook ?: No
- Running in Google Colab ?: No
- Token path ?: ~/.cache/huggingface/token
- Has saved token ?: False
- Configured git credential helpers:
- FastAI: N/A
- Tensorflow: N/A
- Torch: 1.13.0+cu116
- Jinja2: 3.1.2
- Graphviz: N/A
- Pydot: N/A
- Pillow: 9.4.0
- hf_transfer: N/A
- ENDPOINT: https://huggingface.co
- HUGGINGFACE_HUB_CACHE: ~/.cache/huggingface/hub
- HUGGINGFACE_ASSETS_CACHE: ~/.cache/huggingface/assets
- HF_HUB_OFFLINE: False
- HF_TOKEN_PATH: ~/.cache/huggingface/token
- HF_HUB_DISABLE_PROGRESS_BARS: None
- HF_HUB_DISABLE_SYMLINKS_WARNING: False
- HF_HUB_DISABLE_IMPLICIT_TOKEN: False
- HF_HUB_ENABLE_HF_TRANSFER: False
@pcuenca - Curious if you were able to get any sense out of the error and the issue??
Gently ping @pcuenca
Hi @polavishnu4444, sorry for the delay :( I don't really know why this would happen sporadically, and I haven't been able to reproduce. I assume you are using the unmodified version of the train_dreambooth_lora script, is that correct? Are you still seeing this with PyTorch 1.13.1 (instead of 1.13.0) and the latest version of diffusers?
Maybe @patil-suraj or @sayakpaul can provide some insight?
Very much seems like an enviornment setup problem to me, and I couldn't reproduce it on my end either (PyTorch 1.13, CUDA 11.6).
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
Please note that issues that do not follow the contributing guidelines are likely to be ignored.