diffusers icon indicating copy to clipboard operation
diffusers copied to clipboard

Remove redundant comparison inside the diffusion loop of stable video diffusion pipeline

Open dianyo opened this issue 1 year ago • 7 comments

Is your feature request related to a problem? Please describe. I found that the inside the __call__ of stable video diffusion keeps doing async memcpy between host to device as attached. Screenshot 2024-09-12 at 6 45 24 PM

Describe the solution you'd like. The reason for that is actually coming from every time we get self.do_classifier_free_guidance, we compared tensor between int -> get boolean on device -> memcpy that boolean from gpu to cpu.

It'll be good to just assign a variable for it before the loop as the value won't change through the loop.

Additional context. I'm glad to contribute this by opening a PR

dianyo avatar Sep 12 '24 10:09 dianyo

cc: @a-r-r-o-w

asomoza avatar Sep 12 '24 14:09 asomoza

Hey! Could you specify which lines in the code are you mentioning about here? Is it these lines?

a-r-r-o-w avatar Sep 12 '24 14:09 a-r-r-o-w

Hi @a-r-r-o-w There are two places for calling self.do_classifier_free_guidance inside the diffusion loop

  • Concat latent if we want to do classifier free guidance here
  • Perform guidance here

They both have potential memory copy between device as mentioned, because of the following:

@property
def do_classifier_free_guidance(self):
    if isinstance(self.guidance_scale, (int, float)):
        return self.guidance_scale > 1
    return self.guidance_scale.max() > 1

I would like to solve this with a very trivial way, that is, defining a new variable do_classifier_free_guidance = self.do_classifier_free_guidance.item() that only does once .item() go trans from torch.bool to python bool

dianyo avatar Sep 12 '24 17:09 dianyo

Oh, I see what you mean now. The reason why we do this is to enable users to dynamically adjust the guidance scale parameter if a callback is passed by updating the value of self._guidance_scale. So, for example, one could use CFG for the first few steps and disable it for the last few steps by setting it <= 1. When updated, we check if CFG should be enabled or not, for a specific inference step, using the self.do_classifier_free_guidance property. Dynamic guidance is used quite a lot and sometimes even integrated directly into the pipeline (as in CogVideoX).

I understand that it comes with an unnecessary memcpy in this case (I think only num_frames float values are copied and converted to bool?), but I'm afraid there's not much that can be done as we expect to provide this interface in all our pipelines (cc @yiyixuxu).

One solution could be to modify the pipeline to suit your use-cases since our pipelines serve as minimal inference examples and are not specifically geared towards training or are production-ready in any way. Another solution would be to use a constant int/float value by either updating it in the callback (first iteration will still however use the linspace guidance values), or removing the relevant linspace bit (similar to modifying the pipeline to suit your use-case).

a-r-r-o-w avatar Sep 12 '24 19:09 a-r-r-o-w

Thanks for your reply @a-r-r-o-w! Would it be possible to prepare all necessary data before the loop? I'm not sure how many pipelines that people usually call with dynamic guidance scale. I assumed that the value of guidance scale only rely on t, and which can be get before the loop start. I can see it'll lead to a lot of engineering effort to bring this into whole codebase, but as the diffusion process occupy high percentage of the common jobs (and it's loop), I think it's good to implement this for efficiency.

I'm not actually look into an individual solution tbh, I just wanna try to make diffusers more efficient =P

dianyo avatar Sep 13 '24 09:09 dianyo

I'm not actually look into an individual solution tbh, I just wanna try to make diffusers more efficient =P

Thank you, that's so awesome to hear!

I think the first major area of improvement (that can impact all pipelines and bring considerable speedups) would be #9475 and removing all cudaStreamSync's in the pipeline. If you have any insights regarding that, and similar improvements, feel free to chime into the discussion there. We can work together on profiling and reducing cudaMemCpy's after that, WDYT?

a-r-r-o-w avatar Sep 23 '24 21:09 a-r-r-o-w

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

github-actions[bot] avatar Oct 18 '24 15:10 github-actions[bot]

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

github-actions[bot] avatar Nov 12 '24 15:11 github-actions[bot]