diffusers A strange time cost in denoising loop

Describe the bug

I found a stange time cost when I inference StableVideoDiffusionPipeline with GPU. On 3090, the UNet only costs 1.6s, however, a denoising loop costs 3.3s. It is strange. So I analyse the time cost for all parts in the denoising loop with Viztracer (a tool for tracing and visualizing execution time), I found a stange blank in denoising loop. 1714289869561 The green forward belongs to UNet, pink step belongs to scheduler. The __call__ belongs to the pipeline.

If I time it, I got following before unet: 0.0006058, unet time: 1.5902, cfg time: 1.6592, step time: 0.0005975 loop time: 3.2510 for all loops. I do not think class free guidance will cost that time! So, I only remove if self.do_classifier_free_guidance:, and it will be before unet: 0.0005329, unet time: 1.6008, cfg time: 0.001716, step time: 0.02834 for first loop and before unet: 1.6479, unet time: 1.6055, cfg time: 0.001713, step time: 0.02958 for other loops.

If I inference with CPU, it seem right. The brown forward belongs to UNet. The __call__ belongs to the pipeline. before unet time: 0.002647, unet time: 394.4368, cfg time: 0.0008144, step time: 0.008308

I also test on DiTPipeline, it's same with StableVideoDiffusionPipeline. The green forward belongs to DiT. The __call__ belongs to the pipeline.

I'm really curious about what happenes during that time. Thanks a lot for help !

Reproduction

I was simply using StableVideoDiffusionPipeline.

import torch
from diffusers import StableVideoDiffusionPipeline

pipeline = StableVideoDiffusionPipeline.from_pretrained(
    MODEL_PATH, torch_dtype=torch.float16, variant="fp16"
)
pipeline.to("cuda")

frames = pipeline(
    image, 
    decode_chunk_size=5, 
    num_frames=25,
    num_inference_steps=5, # denoising steps
    min_guidance_scale=1.0,
    max_guidance_scale=3.0,
    fps=7,
    ).frames[0]

Tracing time usage

with self.progress_bar(total=num_inference_steps) as progress_bar:
    for i, t in enumerate(timesteps):
        s_t = time.time()
        # expand the latents if we are doing classifier free guidance
        latent_model_input = torch.cat([latents] * 2) if self.do_classifier_free_guidance else latents
        latent_model_input = self.scheduler.scale_model_input(latent_model_input, t)

        # Concatenate image_latents over channels dimension
        latent_model_input = torch.cat([latent_model_input, image_latents], dim=2)
        prepare_t = time.time()
        # predict the noise residual
        noise_pred = self.unet(
            latent_model_input,
            t,
            encoder_hidden_states=image_embeddings,
            added_time_ids=added_time_ids,
            return_dict=False,
        )[0]
        print(noise_pred.shape)
        unet_t = time.time()

        # perform guidance
        if self.do_classifier_free_guidance:
            noise_pred_uncond, noise_pred_cond = noise_pred.chunk(2)
            noise_pred = noise_pred_uncond + self.guidance_scale * (noise_pred_cond - noise_pred_uncond)
        cfg_t = time.time()

        # compute the previous noisy sample x_t -> x_t-1
        latents = self.scheduler.step(noise_pred, t, latents).prev_sample
        step_t = time.time()
        
        if callback_on_step_end is not None:
            callback_kwargs = {}
            for k in callback_on_step_end_tensor_inputs:
                callback_kwargs[k] = locals()[k]
            callback_outputs = callback_on_step_end(self, i, t, callback_kwargs)

            latents = callback_outputs.pop("latents", latents)

        if i == len(timesteps) - 1 or ((i + 1) > num_warmup_steps and (i + 1) % self.scheduler.order == 0):
            progress_bar.update()
        print(f'prepare time: {prepare_t - s_t}, unet time: {unet_t - prepare_t}, cfg time: {cfg_t - unet_t}, step time: {step_t - cfg_t}')
        
        end_time = time.time()
        print('loop time: ', end_time - s_t)

Remove if self.do_classifier_free_guidance:

with self.progress_bar(total=num_inference_steps) as progress_bar:
    for i, t in enumerate(timesteps):
        s_t = time.time()
        # expand the latents if we are doing classifier free guidance
        latent_model_input = torch.cat([latents] * 2) if self.do_classifier_free_guidance else latents
        latent_model_input = self.scheduler.scale_model_input(latent_model_input, t)

        # Concatenate image_latents over channels dimension
        latent_model_input = torch.cat([latent_model_input, image_latents], dim=2)
        prepare_t = time.time()
        # predict the noise residual
        noise_pred = self.unet(
            latent_model_input,
            t,
            encoder_hidden_states=image_embeddings,
            added_time_ids=added_time_ids,
            return_dict=False,
        )[0]
        print(noise_pred.shape)
        unet_t = time.time()
        # perform guidance
        # if self.do_classifier_free_guidance:
        noise_pred_uncond, noise_pred_cond = noise_pred.chunk(2)
        noise_pred = noise_pred_uncond + self.guidance_scale * (noise_pred_cond - noise_pred_uncond)
        cfg_t = time.time()
        # compute the previous noisy sample x_t -> x_t-1
        latents = self.scheduler.step(noise_pred, t, latents).prev_sample
        step_t = time.time()
                
        if callback_on_step_end is not None:
            callback_kwargs = {}
            for k in callback_on_step_end_tensor_inputs:
                callback_kwargs[k] = locals()[k]
            callback_outputs = callback_on_step_end(self, i, t, callback_kwargs)

            latents = callback_outputs.pop("latents", latents)

        if i == len(timesteps) - 1 or ((i + 1) > num_warmup_steps and (i + 1) % self.scheduler.order == 0):
            progress_bar.update()
        print(f'prepare time: {prepare_t - s_t}, unet time: {unet_t - prepare_t}, cfg time: {cfg_t - unet_t}, step time: {step_t - cfg_t}')
        
        end_time = time.time()
        print('loop time: ', end_time - s_t)

Logs

No response

System Info

Platform: Linux worker6 6.5.0-21-generic # 21~22.04.1-Ubuntu SMP PREEMPT_DYNAMIC Fri Feb 9 13:32:52 UTC 2 x86_64 x86_64 x86_64 GNU/Linux
GPU: NVIDIA Corporation GA102 [GeForce RTX 3090] (rev a1)
CPU: 20 Intel(R) Xeon(R) W-2150B CPU @ 3.00GHz
diffusers version: 0.27.0
Python version: 3.10.13
PyTorch version: 2.2.1+cu121
Huggingface_hub version: 0.22.2
Transformers version: 4.39.3

Who can help?

@DN6 @sayakpaul @yiyixuxu

Apr 28 '24 08:04 FEIFEIEIAr

Thanks for the detailed thread. Might be a good candidate for discussions.

Have you tried warming up the GPU before reporting the numbers? If not, I would suggest running a few warm-up steps first before reporting the numbers.

Apr 28 '24 09:04 sayakpaul

Thanks for the detailed thread. Might be a good candidate for discussions.

Have you tried warming up the GPU before reporting the numbers? If not, I would suggest running a few warm-up steps first before reporting the numbers.

I also tried num_inference_steps=50, even in the last loop, the time did not change alot (even a little slower). So, I didn't try warming up.

Apr 28 '24 09:04 FEIFEIEIAr

In addition to what Sayak said, to my knowledge, PyTorch makes calculations asynchronously in GPU. So, could you put torch.cuda.synchronise() before each time.time()? Also, time.perf_counter() is recommended most of the time instead of time.time().

Apr 28 '24 10:04 tolgacangoz

To my knowledge, PyTorch makes calculations asynchronously in GPU. So, could you put torch.cuda.synchronise() before each time.time()? Also, time.perf_counter() is recommended most of the time instead of time.time().

Thanks! I add torch.cuda.synchronize() after UNet, and the time seems right!

Also, I'm curious about what happened during torch.cuda.synchronize(). And it goes strange when I remove if self.do_classifier_free_guidance:, will that cause any problem? Thanks again!

Apr 28 '24 10:04 FEIFEIEIAr

you might also want to try generating a flame graph on a Linux system.

May 03 '24 23:05 bghira