diffusers icon indicating copy to clipboard operation
diffusers copied to clipboard

Rethinking the `encode_prompt()` method in pipelines

Open sayakpaul opened this issue 1 year ago • 1 comments

This thread is for discussing the possibility of making the most widely used encode_prompt() methods of our pipelines classmethods.

For historical context, I have made such attempts in the past but for different reasons, we decided to not move that forward. We are revisiting that now.

I have given it a good amount of thought and I would like to use to issue detail the approaches that have struck my mind.

Apparoach 1 -- making encode_prompt() a classmethod

If we do this, the API would look something like so for the SDXL encode_prompt() and alike:

@classmethod
def encode_prompt_class_method(
    cls,
    prompt: str,
    prompt_2: Optional[str] = None,
    device: Optional[torch.device] = None,
    num_images_per_prompt: int = 1,
    do_classifier_free_guidance: bool = True,
    negative_prompt: Optional[str] = None,
    negative_prompt_2: Optional[str] = None,
    prompt_embeds: Optional[torch.Tensor] = None,
    negative_prompt_embeds: Optional[torch.Tensor] = None,
    pooled_prompt_embeds: Optional[torch.Tensor] = None,
    negative_pooled_prompt_embeds: Optional[torch.Tensor] = None,
    lora_scale: Optional[float] = None,
    clip_skip: Optional[int] = None,
    text_encoder: Optional[CLIPTextModel] = None,
    text_encoder_2: Optional[CLIPTextModelWithProjection] = None,
    tokenizer: Optional[CLIPTokenizer] = None,
    tokenizer_2: Optional[CLIPTokenizer] = None
):

(We may have to add additional arguments but for demonstration purposes, this should be sufficient)

The actual encode_prompt() would then call it like so:

self.encode_prompt_with_class_method(..., text_encoder=self.text_encoder, ...)

Problems

The user needs to know the text encoders and tokenizers they need to initialize for utilizing the encode_prompt_with_class_method(). This might make the developer experience a little bit convoluted compared to approach 2.

Apparoach 2 -- making encode_prompt() function correctly with valid pipeline initialization

We support initializing pipelines with some model-level components set to None. So, this for example:

from transformers import T5EncoderModel
from diffusers import PixArtAlphaPipeline
import torch

pipe = PixArtAlphaPipeline.from_pretrained(
    "PixArt-alpha/PixArt-XL-2-1024-MS",
    transformer=None
)
...

And then the users should be able to call pipe.encode_prompt(...). I like this approach better than Approach 1 because:

  • We are not introducing any new classmethod variants here.
  • Developers initialize the pipelines almost the same way they would. They just set the components unnecessary for running encode_prompt() to None.
  • Since our pipeline components can be reused to initialize other pipelines this should not lead to any memory wastage.
  • Users can still pass any fine-tuned version of text encoder if they way while initializing the pipeline. All of it should work given compatibility is guaranteed.

@yiyixuxu @DN6 would love to know what you think. After we reach a consensus, I will start the work.

sayakpaul avatar Jun 28 '24 03:06 sayakpaul

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

github-actions[bot] avatar Sep 14 '24 15:09 github-actions[bot]

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

github-actions[bot] avatar Nov 09 '24 15:11 github-actions[bot]

Gentle ping if this is still planned, or if we're going to keep things as-is and improve how this works in modular diffusers

a-r-r-o-w avatar Nov 18 '24 20:11 a-r-r-o-w

We can close this for now as most pipelines provide encode_prompt() implementation that can work with just the text encoders loaded. So, I guess okay.

sayakpaul avatar Nov 19 '24 01:11 sayakpaul