Strange behaviour and issues with SD 2.1 on multi-gpu (4090) and xformers attention
Describe the bug
Hello!
Today I tried to setup SD 2.1 but have some trouble with more than one issue.
pipe = StableDiffusionPipeline.from_pretrained(
"stabilityai/stable-diffusion-2-1",
torch_dtype=torch.float16,
).to("cuda") # or cuda:0
together with
pipe.enable_xformers_memory_efficient_attention()
works in so far as it does produce senseful images. But it is very slow. Obviously the xformers attention is NOT really working. Surprisingly the inference is faster if commenting out that line:
pipe.enable_xformers_memory_efficient_attention() 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 20/20 [00:04<00:00, 4.92it/s] 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 20/20 [00:03<00:00, 5.40it/s] 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 20/20 [00:03<00:00, 5.40it/s]
#pipe.enable_xformers_memory_efficient_attention() 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 20/20 [00:03<00:00, 6.02it/s] 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 20/20 [00:02<00:00, 6.72it/s] 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 20/20 [00:02<00:00, 6.71it/s]
If I change the cuda number from 0 to 1 or 2 the inference speed increases more than 4x but the images are all something like this

and the code stops in the 31st step with an error. It is interesting that it doesn't make a difference if one image shall be created with 50 steps (code stopping in step 31) or if I create four images with 10 steps each. Then three images are created and the 4th image generation is stopped in step 1. So all together 31 steps.
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 10/10 [00:01<00:00, 5.62it/s] 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 10/10 [00:00<00:00, 27.70it/s] 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 10/10 [00:00<00:00, 27.82it/s] 10%|████████████████▋ | 1/10 [00:00<00:00, 16.12it/s]
The error code is this:
Traceback (most recent call last):
File "/home/marc/Schreibtisch/AI/SD2/test.py", line 58, in <module>
image3 = pipe(prompt=prompt, num_inference_steps=10, negative_prompt=n_propmt, guidance_scale=7).images[0]
File "/home/marc/anaconda3/envs/temp/lib/python3.10/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
return func(*args, **kwargs)
File "/home/marc/anaconda3/envs/temp/lib/python3.10/site-packages/diffusers/pipelines/stable_diffusion/pipeline_stable_diffusion.py", line 517, in __call__
noise_pred = self.unet(latent_model_input, t, encoder_hidden_states=text_embeddings).sample
File "/home/marc/anaconda3/envs/temp/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1190, in _call_impl
return forward_call(*input, **kwargs)
File "/home/marc/anaconda3/envs/temp/lib/python3.10/site-packages/diffusers/models/unet_2d_condition.py", line 407, in forward
sample = upsample_block(
File "/home/marc/anaconda3/envs/temp/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1190, in _call_impl
return forward_call(*input, **kwargs)
File "/home/marc/anaconda3/envs/temp/lib/python3.10/site-packages/diffusers/models/unet_2d_blocks.py", line 1203, in forward
hidden_states = attn(hidden_states, encoder_hidden_states=encoder_hidden_states).sample
File "/home/marc/anaconda3/envs/temp/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1190, in _call_impl
return forward_call(*input, **kwargs)
File "/home/marc/anaconda3/envs/temp/lib/python3.10/site-packages/diffusers/models/attention.py", line 216, in forward
hidden_states = block(hidden_states, context=encoder_hidden_states, timestep=timestep)
File "/home/marc/anaconda3/envs/temp/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1190, in _call_impl
return forward_call(*input, **kwargs)
File "/home/marc/anaconda3/envs/temp/lib/python3.10/site-packages/diffusers/models/attention.py", line 484, in forward
hidden_states = self.attn1(norm_hidden_states) + hidden_states
File "/home/marc/anaconda3/envs/temp/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1190, in _call_impl
return forward_call(*input, **kwargs)
File "/home/marc/anaconda3/envs/temp/lib/python3.10/site-packages/diffusers/models/attention.py", line 594, in forward
hidden_states = self.to_out[0](hidden_states)
File "/home/marc/anaconda3/envs/temp/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1190, in _call_impl
return forward_call(*input, **kwargs)
File "/home/marc/anaconda3/envs/temp/lib/python3.10/site-packages/torch/nn/modules/linear.py", line 114, in forward
return F.linear(input, self.weight, self.bias)
RuntimeError: CUDA error: CUBLAS_STATUS_EXECUTION_FAILED when calling `cublasLtMatmul( ltHandle, computeDesc.descriptor(), &alpha_val, mat1_ptr, Adesc.descriptor(), mat2_ptr, Bdesc.descriptor(), &beta_val, result_ptr, Cdesc.descriptor(), result_ptr, Cdesc.descriptor(), &heuristicResult.algo, workspace.data_ptr(), workspaceSize, at::cuda::getCurrentCUDAStream())`
I have the same problem with stable-diffusion-2-1-base. If I replace the model path with runwayml/stable-diffusion-v1-5 everything works fine.
System informations:
Ubuntu 22.04 Python 3.10.6
diffusers 0.10.2 Torch 1.13.0+cu117 xformers 0.0.15.dev0+c101579.d20221128
nvcnn -V Copyright (c) 2005-2022 NVIDIA Corporation Built on Wed_Sep_21_10:33:58_PDT_2022 Cuda compilation tools, release 11.8, V11.8.89 Build cuda_11.8.r11.8/compiler.31833905_0
nvidia-smi Tue Dec 13 00:31:34 2022
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 520.61.05 Driver Version: 520.61.05 CUDA Version: 11.8 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA Graphics... On | 00000000:23:00.0 On | Off |
| 0% 56C P8 39W / 450W | 996MiB / 24564MiB | 19% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 1 NVIDIA Graphics... On | 00000000:2D:00.0 Off | Off |
| 0% 39C P8 29W / 450W | 6MiB / 24564MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 2 NVIDIA Graphics... On | 00000000:2E:00.0 Off | Off |
| 0% 34C P8 24W / 450W | 6MiB / 24564MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
Reproduction
import torch import requests from PIL import Image from diffusers import StableDiffusionPipeline
pipe = StableDiffusionPipeline.from_pretrained("stabilityai/stable-diffusion-2-1", torch_dtype=torch.float16).to("cuda:2") pipe.enable_xformers_memory_efficient_attention()
prompt = "one tigers" n_propmt = "artstation"
image = pipe(prompt=prompt, num_inference_steps=20, negative_prompt=n_propmt, guidance_scale=7).images[0] image.save("a.jpg")
prompt = "two tigers" n_propmt = "artstation"
image = pipe(prompt=prompt, num_inference_steps=20, negative_prompt=n_propmt, guidance_scale=7).images[0] image.save("b.jpg")
prompt = "three tigers" n_propmt = "artstation"
image = pipe(prompt=prompt, num_inference_steps=20, negative_prompt=n_propmt, guidance_scale=7).images[0] image.save("c.jpg")
Logs
No response
System Info
Ubuntu 22.04 Python 3.10.6
diffusers 0.10.2 Torch 1.13.0+cu117 xformers 0.0.15.dev0+c101579.d20221128
Cuda compilation tools, release 11.8, V11.8.89 Build cuda_11.8.r11.8/compiler.31833905_0
CUDA Version: 11.8
I have the same problem, I has only one gpu, sd2.0 -v 768 model work with xformers has wrong results. sd2.0-base 512 is normal.
Perhaps try instantiating a DDIM scheduler explicitly and passing it in to the pipeline.from_pretrained() call?
@lawfordp2017 Same problem.
Hmm, that's interesting. @pcuenca could you maybe check whether it works for you on a 3090?
@linyu0219 what GPU did you use, 4090 or something else? Also would be nice if you could share the code snippet.
Works fine for me on 3090 (cuda:0) and 2080 Ti (cuda:1). I don't currently have any 4090s to test.
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
Please note that issues that do not follow the contributing guidelines are likely to be ignored.