diffusers PNDMScheduler has a considerable cost when running StableDiffusionPipeline in diffuser-0.4.1

Describe the bug

Enviroment

GPU: A10, CUDA 11.6, cuDNN 8.4.0 Torch: 1.12.1 diffuser: 0.4.1

Phenomenon

When I ran the StableDiffusionPipeline with fp16 precision, I found the time cost of PNDMScheduler increase a lot after I upgraded the diffuser to 0.4.1. It costs about 4.2 seconds while unet costs 6.6 seconds. The time cost of PNDMScheduler in diffuser-0.3.0 can be almost ignored. I wonder what happends with diffuser upgrade.

Profile

Code

from diffusers import StableDiffusionPipeline
import time
import torch

pipe = StableDiffusionPipeline.from_pretrained(
    "CompVis/stable-diffusion-v1-4",
    revision="fp16",
    torch_dtype=torch.float16,
    use_auth_token=True)
pipe = pipe.to("cuda")

prompt = "a photo of an astronaut riding a horse on mars"
start = time.time()
image = pipe(prompt, num_inference_steps=100).images[0]
time_cost = time.time() - start
image.save("astronaut_rides_horse.png")
print(f"Image saved! Total time cost: {time_cost:2f} s")

Reproduction

No response

Logs

No response

System Info

diffusers version: 0.4.1
Platform: Linux-4.14.0_1-0-0-43-x86_64-with-centos-7.9.2009-Core
Python version: 3.7.0
PyTorch version (GPU?): 1.12.1+cu116 (True)
Huggingface_hub version: 0.10.0
Transformers version: 4.22.2
Using GPU in script?:
Using distributed or parallel set-up in script?:

Oct 09 '22 03:10 joey12300

Hey @joey12300,

I cannot really confirm this. I'm running the following script:

#!/usr/bin/env python3
from diffusers import StableDiffusionPipeline
import diffusers
import time
import torch

pipe = StableDiffusionPipeline.from_pretrained(
    "CompVis/stable-diffusion-v1-4",
    revision="fp16",
    torch_dtype=torch.float16,
)
pipe = pipe.to("cuda")

prompt = "a photo of an astronaut riding a horse on mars"
start = time.time()

if diffusers.__version__ == "0.3.0":
    with torch.autocast("cuda"):
        image = pipe(prompt, num_inference_steps=100).images[0]
else:
    image = pipe(prompt, num_inference_steps=100).images[0]

time_cost = time.time() - start
image.save("astronaut_rides_horse.png")
print(f"Image saved! Total time cost: {time_cost:2f} s")

Note that in version 0.3.0 it was not yet possible to run fp16 without autocast. I'm getting the following result for 0.3.0:

Image saved! Total time cost: 10.437387 s

and for 0.4.1:

Image saved! Total time cost: 8.903677 s

Showing that 0.4.1 is significantly faster. I'm using a A100 GPU.

Oct 09 '22 16:10 patrickvonplaten

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

Nov 08 '22 15:11 github-actions[bot]