diffusers Strange behaviour and issues with SD 2.1 on multi-gpu (4090) and xformers attention

Describe the bug

Hello!

Today I tried to setup SD 2.1 but have some trouble with more than one issue.

pipe = StableDiffusionPipeline.from_pretrained(
   "stabilityai/stable-diffusion-2-1",
   torch_dtype=torch.float16,
).to("cuda")   # or cuda:0

together with

pipe.enable_xformers_memory_efficient_attention()

works in so far as it does produce senseful images. But it is very slow. Obviously the xformers attention is NOT really working. Surprisingly the inference is faster if commenting out that line:

pipe.enable_xformers_memory_efficient_attention() 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 20/20 [00:04<00:00, 4.92it/s] 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 20/20 [00:03<00:00, 5.40it/s] 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 20/20 [00:03<00:00, 5.40it/s]

#pipe.enable_xformers_memory_efficient_attention() 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 20/20 [00:03<00:00, 6.02it/s] 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 20/20 [00:02<00:00, 6.72it/s] 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 20/20 [00:02<00:00, 6.71it/s]

If I change the cuda number from 0 to 1 or 2 the inference speed increases more than 4x but the images are all something like this

and the code stops in the 31st step with an error. It is interesting that it doesn't make a difference if one image shall be created with 50 steps (code stopping in step 31) or if I create four images with 10 steps each. Then three images are created and the 4th image generation is stopped in step 1. So all together 31 steps.

100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 10/10 [00:01<00:00, 5.62it/s] 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 10/10 [00:00<00:00, 27.70it/s] 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 10/10 [00:00<00:00, 27.82it/s] 10%|████████████████▋ | 1/10 [00:00<00:00, 16.12it/s]

The error code is this:

Traceback (most recent call last):
  File "/home/marc/Schreibtisch/AI/SD2/test.py", line 58, in <module>
    image3 = pipe(prompt=prompt, num_inference_steps=10, negative_prompt=n_propmt, guidance_scale=7).images[0]
  File "/home/marc/anaconda3/envs/temp/lib/python3.10/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
    return func(*args, **kwargs)
  File "/home/marc/anaconda3/envs/temp/lib/python3.10/site-packages/diffusers/pipelines/stable_diffusion/pipeline_stable_diffusion.py", line 517, in __call__
    noise_pred = self.unet(latent_model_input, t, encoder_hidden_states=text_embeddings).sample
  File "/home/marc/anaconda3/envs/temp/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1190, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/marc/anaconda3/envs/temp/lib/python3.10/site-packages/diffusers/models/unet_2d_condition.py", line 407, in forward
    sample = upsample_block(
  File "/home/marc/anaconda3/envs/temp/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1190, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/marc/anaconda3/envs/temp/lib/python3.10/site-packages/diffusers/models/unet_2d_blocks.py", line 1203, in forward
    hidden_states = attn(hidden_states, encoder_hidden_states=encoder_hidden_states).sample
  File "/home/marc/anaconda3/envs/temp/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1190, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/marc/anaconda3/envs/temp/lib/python3.10/site-packages/diffusers/models/attention.py", line 216, in forward
    hidden_states = block(hidden_states, context=encoder_hidden_states, timestep=timestep)
  File "/home/marc/anaconda3/envs/temp/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1190, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/marc/anaconda3/envs/temp/lib/python3.10/site-packages/diffusers/models/attention.py", line 484, in forward
    hidden_states = self.attn1(norm_hidden_states) + hidden_states
  File "/home/marc/anaconda3/envs/temp/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1190, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/marc/anaconda3/envs/temp/lib/python3.10/site-packages/diffusers/models/attention.py", line 594, in forward
    hidden_states = self.to_out[0](hidden_states)
  File "/home/marc/anaconda3/envs/temp/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1190, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/marc/anaconda3/envs/temp/lib/python3.10/site-packages/torch/nn/modules/linear.py", line 114, in forward
    return F.linear(input, self.weight, self.bias)
RuntimeError: CUDA error: CUBLAS_STATUS_EXECUTION_FAILED when calling `cublasLtMatmul( ltHandle, computeDesc.descriptor(), &alpha_val, mat1_ptr, Adesc.descriptor(), mat2_ptr, Bdesc.descriptor(), &beta_val, result_ptr, Cdesc.descriptor(), result_ptr, Cdesc.descriptor(), &heuristicResult.algo, workspace.data_ptr(), workspaceSize, at::cuda::getCurrentCUDAStream())`

I have the same problem with stable-diffusion-2-1-base. If I replace the model path with runwayml/stable-diffusion-v1-5 everything works fine.

System informations:

Ubuntu 22.04 Python 3.10.6

diffusers 0.10.2 Torch 1.13.0+cu117 xformers 0.0.15.dev0+c101579.d20221128

nvidia-smi Tue Dec 13 00:31:34 2022

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 520.61.05    Driver Version: 520.61.05    CUDA Version: 11.8     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA Graphics...  On   | 00000000:23:00.0  On |                  Off |
|  0%   56C    P8    39W / 450W |    996MiB / 24564MiB |     19%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  NVIDIA Graphics...  On   | 00000000:2D:00.0 Off |                  Off |
|  0%   39C    P8    29W / 450W |      6MiB / 24564MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   2  NVIDIA Graphics...  On   | 00000000:2E:00.0 Off |                  Off |
|  0%   34C    P8    24W / 450W |      6MiB / 24564MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

Reproduction

import torch import requests from PIL import Image from diffusers import StableDiffusionPipeline

pipe = StableDiffusionPipeline.from_pretrained("stabilityai/stable-diffusion-2-1", torch_dtype=torch.float16).to("cuda:2") pipe.enable_xformers_memory_efficient_attention()

prompt = "one tigers" n_propmt = "artstation"

image = pipe(prompt=prompt, num_inference_steps=20, negative_prompt=n_propmt, guidance_scale=7).images[0] image.save("a.jpg")

prompt = "two tigers" n_propmt = "artstation"

image = pipe(prompt=prompt, num_inference_steps=20, negative_prompt=n_propmt, guidance_scale=7).images[0] image.save("b.jpg")

prompt = "three tigers" n_propmt = "artstation"

image = pipe(prompt=prompt, num_inference_steps=20, negative_prompt=n_propmt, guidance_scale=7).images[0] image.save("c.jpg")

Logs

No response

System Info

Ubuntu 22.04 Python 3.10.6

diffusers 0.10.2 Torch 1.13.0+cu117 xformers 0.0.15.dev0+c101579.d20221128

Cuda compilation tools, release 11.8, V11.8.89 Build cuda_11.8.r11.8/compiler.31833905_0

CUDA Version: 11.8

Dec 12 '22 23:12 Marcophono2

I have the same problem, I has only one gpu, sd2.0 -v 768 model work with xformers has wrong results. sd2.0-base 512 is normal.

Dec 13 '22 01:12 linyu0219

Perhaps try instantiating a DDIM scheduler explicitly and passing it in to the pipeline.from_pretrained() call?

Dec 13 '22 02:12 lawfordp2017

@lawfordp2017 Same problem.

Dec 13 '22 17:12 Marcophono2

Hmm, that's interesting. @pcuenca could you maybe check whether it works for you on a 3090?

Dec 19 '22 11:12 patrickvonplaten

@linyu0219 what GPU did you use, 4090 or something else? Also would be nice if you could share the code snippet.

Dec 29 '22 14:12 patil-suraj

Works fine for me on 3090 (cuda:0) and 2080 Ti (cuda:1). I don't currently have any 4090s to test.

Dec 29 '22 17:12 pcuenca

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

Jan 23 '23 15:01 github-actions[bot]