diffusers icon indicating copy to clipboard operation
diffusers copied to clipboard

Performance Issue with RTX 4090 and all SD/Diffusers versions

Open Marcophono2 opened this issue 3 years ago β€’ 37 comments

Describe the bug

Hello!

Since 10 days, nearly round the clock, I try to bring my brand new and proudly owned Geforce RTX 4090 graphic cards to work appropriate with Stable Diffusion. But finally, 10 days later at least, it is still around 50% below its options.

In that 240 hours I changed from Ubuntu to Manjaro (and from Manjaro back to Ubuntu, via Pop OS back again to Ubuntu and to Manjaro Nightly, containing all Nvidia support, more or less working). Ubuntu absolutely did not allow me to bring together 22.10, 22.04 or 20.04 with my AMD hardware

Threadripper Pro 3955WX ASUS PRO WS WRX80E-SAGE

and my graphic card RTX 4090.

Yes, really, it was not the 4090. It was the mainboard and the cpu which made the big trouble since one of the newer Ubuntu OS versions. Or, the other way round: Ubuntu is the (damn) trouble maker. After about 50 re-installations I replaced the 4090 with a Geforce 2070, started from the scratch and found myself again (and again) in the same position: Yelling and cursing! Still that same issues.

Meanwhile, yes better now, I could bring together Manjaro with CUDA 11.8, Nvidia Driver Version: 520.56.06, Cuda compilation tools 11.8, V11.8.89, (build cuda_11.8.r11.8/compiler.31833905_0)

and used the nightly PyTorch version 1.13

Benchmark results:

with RTX 3090 (512x512) standard, fp16 12.7 it/s

with RTX 3090 (512x512) fp16, prepared unet optimization 14.9 it/s

with RTX 4090 (512x512) fp16, with or without optimization 11.5 it/s

So, at the end (of my long issue description) there is still the question: Why is SD, by all software and driver support so weak on a RTX 4090 compared to a RTX 3090?

I know that xFormers is an impressing performance boost and advantage lately. But I excluded xFormers in m< banchmarks.

Can anyone help me? I am frustrated meanwhile. If someone can help me to fix this missing link between SD, Nvidia support , PyTorch and my hardware, I am generous.

Best regards Marc

Reproduction

No response

Logs

No response

System Info

Manjaro Nightly

Threadripper Pro 3955WX ASUS PRO WS WRX80E-SAGE

CUDA 11.8, Nvidia Driver Version: 520.56.06, Cuda compilation tools 11.8, V11.8.89, (build cuda_11.8.r11.8/compiler.31833905_0)

nightly PyTorch version 1.13

Marcophono2 avatar Oct 23 '22 01:10 Marcophono2

Uff I don't really know the details of GPU hardware enough here sadly @NouamaneTazi do you have a hunch maybe? :-)

patrickvonplaten avatar Oct 25 '22 11:10 patrickvonplaten

@NouamaneTazi , do you have a hunch? :)

Marcophono2 avatar Oct 27 '22 18:10 Marcophono2

I never worked with Pytorch 1.13 nor CUDA 11.8 before. Does the 3 benchmarks use the same environment @Marcophono2?

NouamaneTazi avatar Oct 27 '22 21:10 NouamaneTazi

Yes, they did, @NouamaneTazi . Meanwhile I found out that the RP https://github.com/AUTOMATIC1111/stable-diffusion-webui offers a technical support for the 4090. I can produce a 640x640px image with 22it/s. I used this unusual size because with 512x512 my GPU utilization is only 60%. There is no other bottleneck. So I could enlarge the size until the utilization is nearly 100% without losing speed. So, the performance is great but this is a Windows solution only and at the moment only with a web UI. Not so good for command line based processing with multi GPU platform.

Marcophono2 avatar Oct 27 '22 21:10 Marcophono2

I'm afraid I can't help much myself as I don't have access to any Lovelace GPU. It seems that updating cudnn should help speed up inference. I would recommend you follow the following threads: https://github.com/AUTOMATIC1111/stable-diffusion-webui/issues/2449 and https://github.com/AUTOMATIC1111/stable-diffusion-webui/discussions/2537. The same should apply to diffusers

NouamaneTazi avatar Oct 27 '22 22:10 NouamaneTazi

@NouamaneTazi , yes, I know that threads and followed every detail. But finally it was not able to build it in the same, or in a similar way, for Linux. Oh, no problem that you don't have access to a Lovelace GPU. Feel free to use my vasti account and book a 4090 instance there. Right at this moment I am using one. My billing account there is good feeded, so you can use a 4090 instance round the clock for days. Just send me a message if you are interested and I send you my login data.

Marcophono2 avatar Oct 27 '22 22:10 Marcophono2

@NouamaneTazi I lost so much time to find a solution that it would be a big pleasure for me to pay you for your work if you are successful. (but I need it latest on monday, sorry :-)

Marcophono2 avatar Oct 27 '22 22:10 Marcophono2

@Marcophono2 To build xformers for Lovelace, you need to modify torch/utils/cpp_extension.py to include CUDA arch "8.9"

PyTorch 1.13 regressed performance on my machine, so you may be losing performance there.

C43H66N12O12S2 avatar Oct 29 '22 09:10 C43H66N12O12S2

@C43H66N12O12S2 Interesting! But I think there is more necessary than adding

    ('Lovelace', '8.9+PTX'),

and

supported_arches = ['3.5', '3.7', '5.0', '5.2', '5.3', '6.0', '6.1', '6.2',
                    '7.0', '7.2', '7.5', '8.0', '8.6', '8.9']

:) But what exactly? I installed everything again and stay with PyTorch 1.12 now as you recommended.

Marcophono2 avatar Oct 30 '22 17:10 Marcophono2

After you modify the file, set TORCH_CUDA_ARCH_LIST=β€œ8.9” env variable.

This is how I compile my Windows wheels.

C43H66N12O12S2 avatar Oct 30 '22 17:10 C43H66N12O12S2

@C43H66N12O12S2 Okay. But shouldn't it be the other way round? PyTorch must be new compiled with that added environment parameter, is that correct? But then the cpp_extension.py file will be overwritten.

Marcophono2 avatar Oct 30 '22 18:10 Marcophono2

@C43H66N12O12S2 And also I think I need cuda 11.8 for the SD project then, or not?

Marcophono2 avatar Oct 30 '22 18:10 Marcophono2

No, PyTorch (and the official releases) is fine. Modification of cpp_extension.py is necessary because PyTorch has a hardblock on any cuda arch not on their list.

You need 11.8 nvcc to compile CUDA 8.9, yes. Not for inference.

C43H66N12O12S2 avatar Oct 30 '22 18:10 C43H66N12O12S2

@C43H66N12O12S2 Okay. That sounds easier than awaited. :) Can you tell me where in the environment I have to add it? Or now to add it to a command?

Marcophono2 avatar Oct 30 '22 18:10 Marcophono2

In Linux, TORCH_CUDA_ARCH_LIST="8.9" pip wheel -e . inside cloned xformers repo should work.

C43H66N12O12S2 avatar Oct 30 '22 18:10 C43H66N12O12S2

Great Thanks a lot, @C43H66N12O12S2 ! I have a good feeling that this will bring me a big step forward! :+1:

Marcophono2 avatar Oct 30 '22 18:10 Marcophono2

@C43H66N12O12S2 I was too optimistical. I think I did it all correct (not really sure of course) but I get a large error output.

My setup:

I installed the branch from @MatthieuTPHR -> https://github.com/MatthieuTPHR/diffusers/archive/refs/heads/memory_efficient_attention.zip

and I installed xformers as you mentioned. If I then start a little test program

import torch
from diffusers import StableDiffusionPipeline

pipe = StableDiffusionPipeline.from_pretrained(
   "runwayml/stable-diffusion-v1-5", revision="fp16", torch_dtype=torch.float16, use_auth_token="hf_LFWSneVmdLYPKbkIRpCrCKxxx",
).to("cuda")

with torch.inference_mode(), torch.autocast("cuda"):
   image = pipe("a small cat")

with

USE_MEMORY_EFFICIENT_ATTENTION=1 python test.py

I receive this following long error text. The attention.py model realizes correctly that xformers is present. Any ideas what could be wrong? If I only start with

python test.py

the image is created but with less than 10it/s. A bit weak for a 4090. Also I noticed that my gpu memory is always around 22-23 GB occupied and utilization is at 99%.


 οŒ’ ξ‚°  ~/Schreibtisch/AI ξ‚° USE_MEMORY_EFFICIENT_ATTENTION=1 python test.py                                               ξ‚² IOT ✘ ξ‚² 55s ο‰’ ξ‚² SD  
xformers ist vorhanden
Downloading: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 543/543 [00:00<00:00, 815kB/s]
Downloading: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 342/342 [00:00<00:00, 512kB/s]
Downloading: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 4.63k/4.63k [00:00<00:00, 6.35MB/s]
Downloading: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 608M/608M [00:06<00:00, 100MB/s]
Downloading: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 209/209 [00:00<00:00, 314kB/s]
Downloading: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 209/209 [00:00<00:00, 297kB/s]
Downloading: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 572/572 [00:00<00:00, 880kB/s]
Downloading: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 246M/246M [00:03<00:00, 77.8MB/s]
Downloading: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 525k/525k [00:00<00:00, 1.26MB/s]
Downloading: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 472/472 [00:00<00:00, 719kB/s]
Downloading: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 788/788 [00:00<00:00, 1.19MB/s]
Downloading: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 1.06M/1.06M [00:00<00:00, 1.75MB/s]
Downloading: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 772/772 [00:00<00:00, 1.16MB/s]
Downloading: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 1.72G/1.72G [00:15<00:00, 114MB/s]
Downloading: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 550/550 [00:00<00:00, 834kB/s]
Downloading: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 167M/167M [00:02<00:00, 70.8MB/s]
Fetching 16 files: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 16/16 [00:50<00:00,  3.13s/it]
  0%|                                                                                                                     | 0/51 [00:00<?, ?it/s]
Traceback (most recent call last):
  File "/home/marc/Schreibtisch/AI/test.py", line 10, in <module>
    image = pipe("a small cat")
  File "/home/anaconda3/envs/SD/lib/python3.9/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
    return func(*args, **kwargs)
  File "/home/marc/Schreibtisch/AI/diffusers-memory_efficient_attention/src/diffusers/pipelines/stable_diffusion/pipeline_stable_diffusion.py", line 326, in __call__
    noise_pred = self.unet(latent_model_input, t, encoder_hidden_states=text_embeddings).sample
  File "/home/anaconda3/envs/SD/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/marc/Schreibtisch/AI/diffusers-memory_efficient_attention/src/diffusers/models/unet_2d_condition.py", line 296, in forward
    sample, res_samples = downsample_block(
  File "/home/anaconda3/envs/SD/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/marc/Schreibtisch/AI/diffusers-memory_efficient_attention/src/diffusers/models/unet_2d_blocks.py", line 563, in forward
    hidden_states = attn(hidden_states, context=encoder_hidden_states)
  File "/home/anaconda3/envs/SD/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/marc/Schreibtisch/AI/diffusers-memory_efficient_attention/src/diffusers/models/attention.py", line 187, in forward
    hidden_states = block(hidden_states, context=context)
  File "/home/anaconda3/envs/SD/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/marc/Schreibtisch/AI/diffusers-memory_efficient_attention/src/diffusers/models/attention.py", line 236, in forward
    hidden_states = self.attn1(self.norm1(hidden_states)) + hidden_states
  File "/home/anaconda3/envs/SD/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/marc/Schreibtisch/AI/diffusers-memory_efficient_attention/src/diffusers/models/attention.py", line 275, in forward
    out = xformers.ops.memory_efficient_attention(q, k, v, attn_bias=None, op=self.attention_op)
  File "/home/marc/Schreibtisch/AI/xformers/xformers/ops.py", line 862, in memory_efficient_attention
    return op.forward_no_grad(
  File "/home/marc/Schreibtisch/AI/xformers/xformers/ops.py", line 305, in forward_no_grad
    return cls.FORWARD_OPERATOR(
  File "/home/anaconda3/envs/SD/lib/python3.9/site-packages/torch/_ops.py", line 143, in __call__
    return self._op(*args, **kwargs or {})
NotImplementedError: Could not run 'xformers::efficient_attention_forward_cutlass' with arguments from the 'CUDA' backend. This could be because the operator doesn't exist for this backend, or was omitted during the selective/custom build process (if using custom build). If you are a Facebook employee using PyTorch on mobile, please visit https://fburl.com/ptmfixes for possible resolutions. 'xformers::efficient_attention_forward_cutlass' is only available for these backends: [UNKNOWN_TENSOR_TYPE_ID, QuantizedXPU, UNKNOWN_TENSOR_TYPE_ID, UNKNOWN_TENSOR_TYPE_ID, UNKNOWN_TENSOR_TYPE_ID, UNKNOWN_TENSOR_TYPE_ID, UNKNOWN_TENSOR_TYPE_ID, SparseCPU, SparseCUDA, SparseHIP, UNKNOWN_TENSOR_TYPE_ID, UNKNOWN_TENSOR_TYPE_ID, UNKNOWN_TENSOR_TYPE_ID, SparseVE, UNKNOWN_TENSOR_TYPE_ID, NestedTensorCUDA, UNKNOWN_TENSOR_TYPE_ID, UNKNOWN_TENSOR_TYPE_ID, UNKNOWN_TENSOR_TYPE_ID, UNKNOWN_TENSOR_TYPE_ID, UNKNOWN_TENSOR_TYPE_ID, UNKNOWN_TENSOR_TYPE_ID].

BackendSelect: fallthrough registered at /opt/conda/conda-bld/pytorch_1659484806139/work/aten/src/ATen/core/BackendSelectFallbackKernel.cpp:3 [backend fallback]
Python: registered at /opt/conda/conda-bld/pytorch_1659484806139/work/aten/src/ATen/core/PythonFallbackKernel.cpp:133 [backend fallback]
Named: registered at /opt/conda/conda-bld/pytorch_1659484806139/work/aten/src/ATen/core/NamedRegistrations.cpp:7 [backend fallback]
Conjugate: registered at /opt/conda/conda-bld/pytorch_1659484806139/work/aten/src/ATen/ConjugateFallback.cpp:18 [backend fallback]
Negative: registered at /opt/conda/conda-bld/pytorch_1659484806139/work/aten/src/ATen/native/NegateFallback.cpp:18 [backend fallback]
ZeroTensor: registered at /opt/conda/conda-bld/pytorch_1659484806139/work/aten/src/ATen/ZeroTensorFallback.cpp:86 [backend fallback]
ADInplaceOrView: fallthrough registered at /opt/conda/conda-bld/pytorch_1659484806139/work/aten/src/ATen/core/VariableFallbackKernel.cpp:64 [backend fallback]
AutogradOther: fallthrough registered at /opt/conda/conda-bld/pytorch_1659484806139/work/aten/src/ATen/core/VariableFallbackKernel.cpp:35 [backend fallback]
AutogradCPU: fallthrough registered at /opt/conda/conda-bld/pytorch_1659484806139/work/aten/src/ATen/core/VariableFallbackKernel.cpp:39 [backend fallback]
AutogradCUDA: fallthrough registered at /opt/conda/conda-bld/pytorch_1659484806139/work/aten/src/ATen/core/VariableFallbackKernel.cpp:47 [backend fallback]
AutogradXLA: fallthrough registered at /opt/conda/conda-bld/pytorch_1659484806139/work/aten/src/ATen/core/VariableFallbackKernel.cpp:51 [backend fallback]
AutogradMPS: fallthrough registered at /opt/conda/conda-bld/pytorch_1659484806139/work/aten/src/ATen/core/VariableFallbackKernel.cpp:59 [backend fallback]
AutogradXPU: fallthrough registered at /opt/conda/conda-bld/pytorch_1659484806139/work/aten/src/ATen/core/VariableFallbackKernel.cpp:43 [backend fallback]
AutogradHPU: fallthrough registered at /opt/conda/conda-bld/pytorch_1659484806139/work/aten/src/ATen/core/VariableFallbackKernel.cpp:68 [backend fallback]
AutogradLazy: fallthrough registered at /opt/conda/conda-bld/pytorch_1659484806139/work/aten/src/ATen/core/VariableFallbackKernel.cpp:55 [backend fallback]
Tracer: registered at /opt/conda/conda-bld/pytorch_1659484806139/work/torch/csrc/autograd/TraceTypeManual.cpp:295 [backend fallback]
AutocastCPU: fallthrough registered at /opt/conda/conda-bld/pytorch_1659484806139/work/aten/src/ATen/autocast_mode.cpp:481 [backend fallback]
Autocast: fallthrough registered at /opt/conda/conda-bld/pytorch_1659484806139/work/aten/src/ATen/autocast_mode.cpp:324 [backend fallback]
Batched: registered at /opt/conda/conda-bld/pytorch_1659484806139/work/aten/src/ATen/BatchingRegistrations.cpp:1064 [backend fallback]
VmapMode: fallthrough registered at /opt/conda/conda-bld/pytorch_1659484806139/work/aten/src/ATen/VmapModeRegistrations.cpp:33 [backend fallback]
Functionalize: registered at /opt/conda/conda-bld/pytorch_1659484806139/work/aten/src/ATen/FunctionalizeFallbackKernel.cpp:89 [backend fallback]
PythonTLSSnapshot: registered at /opt/conda/conda-bld/pytorch_1659484806139/work/aten/src/ATen/core/PythonFallbackKernel.cpp:137 [backend fallback]

Marcophono2 avatar Oct 30 '22 23:10 Marcophono2

It looks like you made some errors while compiling and the resulting xformers lacks any SASS code for 8.9

As for performance issues with the 4090, you could try following my advice inside the thread posted earlier by Nouamane.

C43H66N12O12S2 avatar Oct 30 '22 23:10 C43H66N12O12S2

To be honest, I do not know where I missed something. I really would be happy if you can see something going wrong:

 οŒ’ ξ‚°  ~ ξ‚° nvidia-smi                                                                                                              ξ‚² βœ” ξ‚² base  
Mon Oct 31 01:14:36 2022       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 520.56.06    Driver Version: 520.56.06    CUDA Version: 11.8     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA GeForce ...  Off  | 00000000:2D:00.0  On |                  Off |
|  0%   39C    P8    35W / 450W |    447MiB / 24564MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A       705      G   /usr/lib/Xorg                     210MiB |
|    0   N/A  N/A       873      G   /usr/bin/kwin_x11                  46MiB |
|    0   N/A  N/A       892      G   /usr/bin/plasmashell               57MiB |
|    0   N/A  N/A      1347      G   /usr/lib/firefox/firefox          126MiB |
+-----------------------------------------------------------------------------+
  οŒ’ ξ‚°  ~ ξ‚° nvcc -V                                                                                                                 ξ‚² βœ” ξ‚² base  
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2022 NVIDIA Corporation
Built on Wed_Sep_21_10:33:58_PDT_2022
Cuda compilation tools, release 11.8, V11.8.89
Build cuda_11.8.r11.8/compiler.31833905_0
 οŒ’ ξ‚°  ~/Schreibtisch/AI/xformers ξ‚° ο„“  main !1 ?15 ξ‚° TORCH_CUDA_ARCH_LIST="8.9" pip wheel -e .                                      ξ‚² βœ” ξ‚² SD  
Obtaining file:///home/marc/Schreibtisch/AI/xformers
  Preparing metadata (setup.py) ... done
Collecting torch>=1.12
  File was already downloaded /home/marc/Schreibtisch/AI/xformers/torch-1.13.0-cp39-cp39-manylinux1_x86_64.whl
Collecting numpy
  File was already downloaded /home/marc/Schreibtisch/AI/xformers/numpy-1.23.4-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Collecting pyre-extensions==0.0.23
  File was already downloaded /home/marc/Schreibtisch/AI/xformers/pyre_extensions-0.0.23-py3-none-any.whl
Collecting typing-extensions
  File was already downloaded /home/marc/Schreibtisch/AI/xformers/typing_extensions-4.4.0-py3-none-any.whl
Collecting typing-inspect
  File was already downloaded /home/marc/Schreibtisch/AI/xformers/typing_inspect-0.8.0-py3-none-any.whl
Collecting nvidia-cudnn-cu11==8.5.0.96
  File was already downloaded /home/marc/Schreibtisch/AI/xformers/nvidia_cudnn_cu11-8.5.0.96-2-py3-none-manylinux1_x86_64.whl
Collecting nvidia-cublas-cu11==11.10.3.66
  File was already downloaded /home/marc/Schreibtisch/AI/xformers/nvidia_cublas_cu11-11.10.3.66-py3-none-manylinux1_x86_64.whl
Collecting nvidia-cuda-nvrtc-cu11==11.7.99
  File was already downloaded /home/marc/Schreibtisch/AI/xformers/nvidia_cuda_nvrtc_cu11-11.7.99-2-py3-none-manylinux1_x86_64.whl
Collecting nvidia-cuda-runtime-cu11==11.7.99
  File was already downloaded /home/marc/Schreibtisch/AI/xformers/nvidia_cuda_runtime_cu11-11.7.99-py3-none-manylinux1_x86_64.whl
Collecting wheel
  File was already downloaded /home/marc/Schreibtisch/AI/xformers/wheel-0.37.1-py2.py3-none-any.whl
Collecting setuptools
  File was already downloaded /home/marc/Schreibtisch/AI/xformers/setuptools-65.5.0-py3-none-any.whl
Collecting mypy-extensions>=0.3.0
  File was already downloaded /home/marc/Schreibtisch/AI/xformers/mypy_extensions-0.4.3-py2.py3-none-any.whl
Building wheels for collected packages: xformers
  Building wheel for xformers (setup.py) ... done
  Created wheel for xformers: filename=xformers-0.0.14.dev0-cp39-cp39-linux_x86_64.whl size=34465759 sha256=6b285b6d9a37c887a8154cc1f00f7291e13dc6eb9b926c8bca7b64cc62607eca
  Stored in directory: /tmp/pip-ephem-wheel-cache-gnqnuxhl/wheels/f6/c7/73/63c154ea45fb20e7eec4f956dfb9c91be386a33afb31b7c359
Successfully built xformers

Yes, I will check again what @NouamaneTazi suggested. I thought that these way is no option because there are some windows dlls involved.

Marcophono2 avatar Oct 31 '22 01:10 Marcophono2

@C43H66N12O12S2 , @NouamaneTazi DAMN!! Just replacing the cuDNN stuff in the torch lib directory brought a 100% speed up mega punch!!! From 9.5 to 17.5it/s. You made an old man happy and smile for the first time since two weeks!! I already used this great trick in the AUTOMATIC111 webUI version successfully but thought, whyever, this isn't possible for Linux. Now I must implement that xFormers thing and .... YEEEEAH!! :-D :-D

Marcophono2 avatar Oct 31 '22 01:10 Marcophono2

@C43H66N12O12S2 , @NouamaneTazi Now at 25it/s! :-))) Still no xFormers. I only let build again uunet weights and added it to the pipeline. flax is similar fast by the way.

Marcophono2 avatar Oct 31 '22 02:10 Marcophono2

You can try env variable without quotes, like this TORCH_CUDA_ARCH_LIST=8.9 pip wheel -e .

If that fails as well, no idea.

C43H66N12O12S2 avatar Oct 31 '22 13:10 C43H66N12O12S2

@C43H66N12O12S2 No, it dis still not work. But after a new setup I am at 28it/s including Euler_a. Probably the PyTorch Nightly (1.4.) gave an extra punch. Meanwhile I am not sure if xFormers would really be able to give still more improvement!? Is xFormers undependigly from unet? Or is it kind of another "version" of the unet implementation from the technical point of view?

P.S.: Is this only a subjective impression by myself or is Euler (Euler_a) really influencing significantly better results? I only try to create images with photorealistic scenes, so I cannot compare this scheduler with others in other disciplines like painting, digital art or else.

Marcophono2 avatar Nov 02 '22 14:11 Marcophono2

@Marcophono2 Can you give a breakdown of what you had to do to get this working for people lacking too much sleep?

But after a new setup I am at 28it/s including Euler_a. Probably the PyTorch Nightly (1.4.) gave an extra punch.

XodrocSO avatar Nov 05 '22 06:11 XodrocSO

Sure, @XodrocSO . Aside from the fact that this RP meanwhile supports Euler too, I can simple tell how I increased my performance for my 4090 (which is now a bit > 30it/s, whyever, and 19.5it/s on a 3090). The most important thing is to update the cudnn files. I must search for the description and the direct download link first again, so let me ask you first: You are talking about 4090 support under linux? If other than 4090 there is no need to update the cudnn files. If you are under Windows and 4090 you also can update the cudnn files but in this case the files are different ones.

Marcophono2 avatar Nov 05 '22 12:11 Marcophono2

@Marcophono2 Windows and 4090 basically, Thanks!

XodrocSO avatar Nov 05 '22 20:11 XodrocSO

So sorry for the delay, @XodrocSO. In the night to yesterday I was so clever to crash my Linux system after a total useless and riskful installation of a classifier to my SD environment which overwrote a lot of packages and depencies so that my wonderful optimized SD was crashing from > 30 it/s to 1.5it/s. On a 4090. OH-MY-GOD! And of course I had no backup. Okay, a helpful backup must have been a complete partition mirroring. But I thought in the worst case I can simply repeat the steps I successfully took. But wrong! Obviously I forgot a lot of things. Matching versions, the order of the installation steps and when cuda and when pip to install. I am on Manjaro so there are not many descriptions I could look for. The Ubuntu setup does not work here. When I made some Google researches I always found my own happy writing about my success - what I destroyed in a moment of brainless.. Anway, after 18 hours I was able to set it up again. And of course I documented every step now. :-)

But that's not the point. You are on Windows so it is easier because there are some good step-by-step howto's to find in the SDwebui RP. Yes, it is not for this SD here (without gui) but I am sure that it will do the necessary setup for you that you can also use this SD after that steps. The point is that PyTorch still not support Lovelace (4090/4080) in the default setup. The wheel which was build by @C43H66N12O12S2 is really a wonderful help for Windows and inject the cudnn libs into PyTorch. Please theck the description from @sigglypuff : https://github.com/AUTOMATIC1111/stable-diffusion-webui/issues/4316#issuecomment-1304612278 Just have a look that he corrected one step in the following posting before you go through it too chronological.

Marcophono2 avatar Nov 07 '22 06:11 Marcophono2

@Marcophono2 You say you're using Manjaro. Gnome version? If so, IIRC that uses zsh by default. Maybe that's why my command didn't work.

Try launching that command from bash, like this bash TORCH_CUDA_ARCH_LIST=8.9 pip wheel -e .

C43H66N12O12S2 avatar Nov 07 '22 14:11 C43H66N12O12S2

Interesting point, @C43H66N12O12S2 . I have the KDE edition. I tried what you wrote but got this error output:

[...]   
    nvcc fatal   : Failed to preprocess host compiler properties.
    [5/5] c++ -MMD -MF /home/marc/Schreibtisch/AI/xformers/build/temp.linux-x86_64-cpython-310/home/marc/Schreibtisch/AI/xformers/third_party/flash-attention/csrc/flash_attn/fmha_api.o.d -pthread -B /home/anaconda3/envs/MII/compiler_compat -Wno-unused-result -Wsign-compare -DNDEBUG -fwrapv -O2 -Wall -fPIC -O2 -isystem /home/anaconda3/envs/MII/include -fPIC -O2 -isystem /home/anaconda3/envs/MII/include -fPIC -I/home/marc/Schreibtisch/AI/xformers/third_party/flash-attention/csrc/flash_attn -I/home/marc/Schreibtisch/AI/xformers/third_party/flash-attention/csrc/flash_attn/src -I/home/marc/Schreibtisch/AI/xformers/third_party/cutlass/include -I/home/anaconda3/envs/MII/lib/python3.10/site-packages/torch/include -I/home/anaconda3/envs/MII/lib/python3.10/site-packages/torch/include/torch/csrc/api/include -I/home/anaconda3/envs/MII/lib/python3.10/site-packages/torch/include/TH -I/home/anaconda3/envs/MII/lib/python3.10/site-packages/torch/include/THC -I/opt/cuda/include -I/home/anaconda3/envs/MII/include/python3.10 -c -c /home/marc/Schreibtisch/AI/xformers/third_party/flash-attention/csrc/flash_attn/fmha_api.cpp -o /home/marc/Schreibtisch/AI/xformers/build/temp.linux-x86_64-cpython-310/home/marc/Schreibtisch/AI/xformers/third_party/flash-attention/csrc/flash_attn/fmha_api.o -O3 -fopenmp -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=_C_flashattention -D_GLIBCXX_USE_CXX11_ABI=0 -std=c++14
    In Datei, eingebunden von /home/marc/Schreibtisch/AI/xformers/third_party/flash-attention/csrc/flash_attn/src/fmha.h:41,
                     von /home/marc/Schreibtisch/AI/xformers/third_party/flash-attention/csrc/flash_attn/fmha_api.cpp:32:
    /home/marc/Schreibtisch/AI/xformers/third_party/flash-attention/csrc/flash_attn/src/fmha_utils.h: In Funktion Β»void set_alpha(uint32_t&, float, Data_type)Β«:
    /home/marc/Schreibtisch/AI/xformers/third_party/flash-attention/csrc/flash_attn/src/fmha_utils.h:63:53: Warnung: Dereferenzierung eines Type-Pun-Zeigers verletzt strict-aliasing-Regeln [-Wstrict-aliasing]
       63 |         alpha = reinterpret_cast<const uint32_t &>( h2 );
          |                                                     ^~
    /home/marc/Schreibtisch/AI/xformers/third_party/flash-attention/csrc/flash_attn/src/fmha_utils.h:68:53: Warnung: Dereferenzierung eines Type-Pun-Zeigers verletzt strict-aliasing-Regeln [-Wstrict-aliasing]
       68 |         alpha = reinterpret_cast<const uint32_t &>( h2 );
          |                                                     ^~
    /home/marc/Schreibtisch/AI/xformers/third_party/flash-attention/csrc/flash_attn/src/fmha_utils.h:70:53: Warnung: Dereferenzierung eines Type-Pun-Zeigers verletzt strict-aliasing-Regeln [-Wstrict-aliasing]
       70 |         alpha = reinterpret_cast<const uint32_t &>( norm );
          |                                                     ^~~~
    /home/marc/Schreibtisch/AI/xformers/third_party/flash-attention/csrc/flash_attn/fmha_api.cpp: In Funktion Β»void set_params_fprop(FMHA_fprop_params&, size_t, size_t, size_t, size_t, size_t, at::Tensor, at::Tensor, at::Tensor, void*, void*, void*, void*, void*, void*, float, float, bool)Β«:
    /home/marc/Schreibtisch/AI/xformers/third_party/flash-attention/csrc/flash_attn/fmha_api.cpp:62:11: Warnung: Β»void* memset(void*, int, size_t)Β« SΓ€ubern eines Objekts von nichttrivialem Typ Β»struct FMHA_fprop_paramsΒ«; use assignment or value-initialization instead [-Wclass-memaccess]
       62 |     memset(&params, 0, sizeof(params));
          |     ~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~
    /home/marc/Schreibtisch/AI/xformers/third_party/flash-attention/csrc/flash_attn/src/fmha.h:74:8: Anmerkung: Β»struct FMHA_fprop_paramsΒ« wird hier deklariert
       74 | struct FMHA_fprop_params : public Qkv_params {
          |        ^~~~~~~~~~~~~~~~~
    /home/marc/Schreibtisch/AI/xformers/third_party/flash-attention/csrc/flash_attn/fmha_api.cpp:58:15: Warnung: Variable Β»acc_typeΒ« wird nicht verwendet [-Wunused-variable]
       58 |     Data_type acc_type = DATA_TYPE_FP32;
          |               ^~~~~~~~
    /home/marc/Schreibtisch/AI/xformers/third_party/flash-attention/csrc/flash_attn/fmha_api.cpp: In Funktion Β»std::vector<at::Tensor> mha_bwd_block(const at::Tensor&, const at::Tensor&, const at::Tensor&, const at::Tensor&, const at::Tensor&, const at::Tensor&, at::Tensor&, at::Tensor&, at::Tensor&, const at::Tensor&, const at::Tensor&, const at::Tensor&, int, int, float, float, bool, c10::optional<at::Generator>)Β«:
    /home/marc/Schreibtisch/AI/xformers/third_party/flash-attention/csrc/flash_attn/fmha_api.cpp:597:10: Warnung: Variable Β»is_sm8xΒ« wird nicht verwendet [-Wunused-variable]
      597 |     bool is_sm8x = dprops->major == 8 && dprops->minor >= 0;
          |          ^~~~~~~
    ninja: build stopped: subcommand failed.
    Traceback (most recent call last):
      File "/home/anaconda3/envs/MII/lib/python3.10/site-packages/torch/utils/cpp_extension.py", line 1901, in _run_ninja_build
        subprocess.run(
      File "/home/anaconda3/envs/MII/lib/python3.10/subprocess.py", line 524, in run
        raise CalledProcessError(retcode, process.args,
    subprocess.CalledProcessError: Command '['ninja', '-v']' returned non-zero exit status 1.
    
    The above exception was the direct cause of the following exception:
    
    Traceback (most recent call last):
      File "<string>", line 2, in <module>
      File "<pip-setuptools-caller>", line 34, in <module>
      File "/home/marc/Schreibtisch/AI/xformers/setup.py", line 251, in <module>
        setuptools.setup(
      File "/home/anaconda3/envs/MII/lib/python3.10/site-packages/setuptools/__init__.py", line 87, in setup
        return distutils.core.setup(**attrs)
      File "/home/anaconda3/envs/MII/lib/python3.10/site-packages/setuptools/_distutils/core.py", line 185, in setup
        return run_commands(dist)
      File "/home/anaconda3/envs/MII/lib/python3.10/site-packages/setuptools/_distutils/core.py", line 201, in run_commands
        dist.run_commands()
      File "/home/anaconda3/envs/MII/lib/python3.10/site-packages/setuptools/_distutils/dist.py", line 968, in run_commands
        self.run_command(cmd)
      File "/home/anaconda3/envs/MII/lib/python3.10/site-packages/setuptools/dist.py", line 1217, in run_command
        super().run_command(command)
      File "/home/anaconda3/envs/MII/lib/python3.10/site-packages/setuptools/_distutils/dist.py", line 987, in run_command
        cmd_obj.run()
      File "/home/anaconda3/envs/MII/lib/python3.10/site-packages/wheel/bdist_wheel.py", line 299, in run
        self.run_command('build')
      File "/home/anaconda3/envs/MII/lib/python3.10/site-packages/setuptools/_distutils/cmd.py", line 319, in run_command
        self.distribution.run_command(command)
      File "/home/anaconda3/envs/MII/lib/python3.10/site-packages/setuptools/dist.py", line 1217, in run_command
        super().run_command(command)
      File "/home/anaconda3/envs/MII/lib/python3.10/site-packages/setuptools/_distutils/dist.py", line 987, in run_command
        cmd_obj.run()
      File "/home/anaconda3/envs/MII/lib/python3.10/site-packages/setuptools/_distutils/command/build.py", line 132, in run
        self.run_command(cmd_name)
      File "/home/anaconda3/envs/MII/lib/python3.10/site-packages/setuptools/_distutils/cmd.py", line 319, in run_command
        self.distribution.run_command(command)
      File "/home/anaconda3/envs/MII/lib/python3.10/site-packages/setuptools/dist.py", line 1217, in run_command
        super().run_command(command)
      File "/home/anaconda3/envs/MII/lib/python3.10/site-packages/setuptools/_distutils/dist.py", line 987, in run_command
        cmd_obj.run()
      File "/home/anaconda3/envs/MII/lib/python3.10/site-packages/setuptools/command/build_ext.py", line 84, in run
        _build_ext.run(self)
      File "/home/anaconda3/envs/MII/lib/python3.10/site-packages/setuptools/_distutils/command/build_ext.py", line 346, in run
        self.build_extensions()
      File "/home/anaconda3/envs/MII/lib/python3.10/site-packages/torch/utils/cpp_extension.py", line 843, in build_extensions
        build_ext.build_extensions(self)
      File "/home/anaconda3/envs/MII/lib/python3.10/site-packages/setuptools/_distutils/command/build_ext.py", line 466, in build_extensions
        self._build_extensions_serial()
      File "/home/anaconda3/envs/MII/lib/python3.10/site-packages/setuptools/_distutils/command/build_ext.py", line 492, in _build_extensions_serial
        self.build_extension(ext)
      File "/home/anaconda3/envs/MII/lib/python3.10/site-packages/setuptools/command/build_ext.py", line 246, in build_extension
        _build_ext.build_extension(self, ext)
      File "/home/anaconda3/envs/MII/lib/python3.10/site-packages/setuptools/_distutils/command/build_ext.py", line 547, in build_extension
        objects = self.compiler.compile(
      File "/home/anaconda3/envs/MII/lib/python3.10/site-packages/torch/utils/cpp_extension.py", line 658, in unix_wrap_ninja_compile
        _write_ninja_file_and_compile_objects(
      File "/home/anaconda3/envs/MII/lib/python3.10/site-packages/torch/utils/cpp_extension.py", line 1573, in _write_ninja_file_and_compile_objects
        _run_ninja_build(
      File "/home/anaconda3/envs/MII/lib/python3.10/site-packages/torch/utils/cpp_extension.py", line 1917, in _run_ninja_build
        raise RuntimeError(message) from e
    RuntimeError: Error compiling objects for extension
    [end of output]

note: This error originates from a subprocess, and is likely not a problem with pip.
ERROR: Failed building wheel for xformers
Running setup.py clean for xformers
Failed to build xformers
ERROR: Failed to build one or more wheels

Marcophono2 avatar Nov 07 '22 15:11 Marcophono2

Ouch @Marcophono2 , sounds like quite the headache! Thanks for the info!

XodrocSO avatar Nov 07 '22 23:11 XodrocSO