A strange time cost in denoising loop
Describe the bug
I found a stange time cost when I inference StableVideoDiffusionPipeline with GPU. On 3090, the UNet only costs 1.6s, however, a denoising loop costs 3.3s.
It is strange. So I analyse the time cost for all parts in the denoising loop with Viztracer (a tool for tracing and visualizing execution time), I found a stange blank in denoising loop.
The green
forward belongs to UNet, pink step belongs to scheduler. The __call__ belongs to the pipeline.
If I time it, I got following before unet: 0.0006058, unet time: 1.5902, cfg time: 1.6592, step time: 0.0005975 loop time: 3.2510 for all loops. I do not think class free guidance will cost that time! So, I only remove if self.do_classifier_free_guidance:, and it will be before unet: 0.0005329, unet time: 1.6008, cfg time: 0.001716, step time: 0.02834 for first loop and before unet: 1.6479, unet time: 1.6055, cfg time: 0.001713, step time: 0.02958 for other loops.
If I inference with CPU, it seem right.
The brown
forward belongs to UNet. The __call__ belongs to the pipeline.
before unet time: 0.002647, unet time: 394.4368, cfg time: 0.0008144, step time: 0.008308
I also test on DiTPipeline, it's same with StableVideoDiffusionPipeline.
The green
forward belongs to DiT. The __call__ belongs to the pipeline.
I'm really curious about what happenes during that time. Thanks a lot for help !
Reproduction
I was simply using StableVideoDiffusionPipeline.
import torch
from diffusers import StableVideoDiffusionPipeline
pipeline = StableVideoDiffusionPipeline.from_pretrained(
MODEL_PATH, torch_dtype=torch.float16, variant="fp16"
)
pipeline.to("cuda")
frames = pipeline(
image,
decode_chunk_size=5,
num_frames=25,
num_inference_steps=5, # denoising steps
min_guidance_scale=1.0,
max_guidance_scale=3.0,
fps=7,
).frames[0]
Tracing time usage
with self.progress_bar(total=num_inference_steps) as progress_bar:
for i, t in enumerate(timesteps):
s_t = time.time()
# expand the latents if we are doing classifier free guidance
latent_model_input = torch.cat([latents] * 2) if self.do_classifier_free_guidance else latents
latent_model_input = self.scheduler.scale_model_input(latent_model_input, t)
# Concatenate image_latents over channels dimension
latent_model_input = torch.cat([latent_model_input, image_latents], dim=2)
prepare_t = time.time()
# predict the noise residual
noise_pred = self.unet(
latent_model_input,
t,
encoder_hidden_states=image_embeddings,
added_time_ids=added_time_ids,
return_dict=False,
)[0]
print(noise_pred.shape)
unet_t = time.time()
# perform guidance
if self.do_classifier_free_guidance:
noise_pred_uncond, noise_pred_cond = noise_pred.chunk(2)
noise_pred = noise_pred_uncond + self.guidance_scale * (noise_pred_cond - noise_pred_uncond)
cfg_t = time.time()
# compute the previous noisy sample x_t -> x_t-1
latents = self.scheduler.step(noise_pred, t, latents).prev_sample
step_t = time.time()
if callback_on_step_end is not None:
callback_kwargs = {}
for k in callback_on_step_end_tensor_inputs:
callback_kwargs[k] = locals()[k]
callback_outputs = callback_on_step_end(self, i, t, callback_kwargs)
latents = callback_outputs.pop("latents", latents)
if i == len(timesteps) - 1 or ((i + 1) > num_warmup_steps and (i + 1) % self.scheduler.order == 0):
progress_bar.update()
print(f'prepare time: {prepare_t - s_t}, unet time: {unet_t - prepare_t}, cfg time: {cfg_t - unet_t}, step time: {step_t - cfg_t}')
end_time = time.time()
print('loop time: ', end_time - s_t)
Remove if self.do_classifier_free_guidance:
with self.progress_bar(total=num_inference_steps) as progress_bar:
for i, t in enumerate(timesteps):
s_t = time.time()
# expand the latents if we are doing classifier free guidance
latent_model_input = torch.cat([latents] * 2) if self.do_classifier_free_guidance else latents
latent_model_input = self.scheduler.scale_model_input(latent_model_input, t)
# Concatenate image_latents over channels dimension
latent_model_input = torch.cat([latent_model_input, image_latents], dim=2)
prepare_t = time.time()
# predict the noise residual
noise_pred = self.unet(
latent_model_input,
t,
encoder_hidden_states=image_embeddings,
added_time_ids=added_time_ids,
return_dict=False,
)[0]
print(noise_pred.shape)
unet_t = time.time()
# perform guidance
# if self.do_classifier_free_guidance:
noise_pred_uncond, noise_pred_cond = noise_pred.chunk(2)
noise_pred = noise_pred_uncond + self.guidance_scale * (noise_pred_cond - noise_pred_uncond)
cfg_t = time.time()
# compute the previous noisy sample x_t -> x_t-1
latents = self.scheduler.step(noise_pred, t, latents).prev_sample
step_t = time.time()
if callback_on_step_end is not None:
callback_kwargs = {}
for k in callback_on_step_end_tensor_inputs:
callback_kwargs[k] = locals()[k]
callback_outputs = callback_on_step_end(self, i, t, callback_kwargs)
latents = callback_outputs.pop("latents", latents)
if i == len(timesteps) - 1 or ((i + 1) > num_warmup_steps and (i + 1) % self.scheduler.order == 0):
progress_bar.update()
print(f'prepare time: {prepare_t - s_t}, unet time: {unet_t - prepare_t}, cfg time: {cfg_t - unet_t}, step time: {step_t - cfg_t}')
end_time = time.time()
print('loop time: ', end_time - s_t)
Logs
No response
System Info
- Platform: Linux worker6 6.5.0-21-generic # 21~22.04.1-Ubuntu SMP PREEMPT_DYNAMIC Fri Feb 9 13:32:52 UTC 2 x86_64 x86_64 x86_64 GNU/Linux
- GPU: NVIDIA Corporation GA102 [GeForce RTX 3090] (rev a1)
- CPU: 20 Intel(R) Xeon(R) W-2150B CPU @ 3.00GHz
- diffusers version: 0.27.0
- Python version: 3.10.13
- PyTorch version: 2.2.1+cu121
- Huggingface_hub version: 0.22.2
- Transformers version: 4.39.3
Who can help?
@DN6 @sayakpaul @yiyixuxu
Thanks for the detailed thread. Might be a good candidate for discussions.
Have you tried warming up the GPU before reporting the numbers? If not, I would suggest running a few warm-up steps first before reporting the numbers.
Thanks for the detailed thread. Might be a good candidate for discussions.
Have you tried warming up the GPU before reporting the numbers? If not, I would suggest running a few warm-up steps first before reporting the numbers.
I also tried num_inference_steps=50, even in the last loop, the time did not change alot (even a little slower). So, I didn't try warming up.
In addition to what Sayak said, to my knowledge, PyTorch makes calculations asynchronously in GPU. So, could you put torch.cuda.synchronise() before each time.time()? Also, time.perf_counter() is recommended most of the time instead of time.time().
To my knowledge, PyTorch makes calculations asynchronously in GPU. So, could you put
torch.cuda.synchronise()before eachtime.time()? Also,time.perf_counter()is recommended most of the time instead oftime.time().
Thanks!
I add torch.cuda.synchronize() after UNet, and the time seems right!
Also, I'm curious about what happened during torch.cuda.synchronize().
And it goes strange when I remove if self.do_classifier_free_guidance:, will that cause any problem?
Thanks again!
you might also want to try generating a flame graph on a Linux system.