diffusers Rethinking the `encode_prompt()` method in pipelines

This thread is for discussing the possibility of making the most widely used encode_prompt() methods of our pipelines classmethods.

For historical context, I have made such attempts in the past but for different reasons, we decided to not move that forward. We are revisiting that now.

I have given it a good amount of thought and I would like to use to issue detail the approaches that have struck my mind.

Apparoach 1 -- making `encode_prompt()` a `classmethod`

If we do this, the API would look something like so for the SDXL encode_prompt() and alike:

@classmethod
def encode_prompt_class_method(
    cls,
    prompt: str,
    prompt_2: Optional[str] = None,
    device: Optional[torch.device] = None,
    num_images_per_prompt: int = 1,
    do_classifier_free_guidance: bool = True,
    negative_prompt: Optional[str] = None,
    negative_prompt_2: Optional[str] = None,
    prompt_embeds: Optional[torch.Tensor] = None,
    negative_prompt_embeds: Optional[torch.Tensor] = None,
    pooled_prompt_embeds: Optional[torch.Tensor] = None,
    negative_pooled_prompt_embeds: Optional[torch.Tensor] = None,
    lora_scale: Optional[float] = None,
    clip_skip: Optional[int] = None,
    text_encoder: Optional[CLIPTextModel] = None,
    text_encoder_2: Optional[CLIPTextModelWithProjection] = None,
    tokenizer: Optional[CLIPTokenizer] = None,
    tokenizer_2: Optional[CLIPTokenizer] = None
):

(We may have to add additional arguments but for demonstration purposes, this should be sufficient)

The actual encode_prompt() would then call it like so:

self.encode_prompt_with_class_method(..., text_encoder=self.text_encoder, ...)

Problems

The user needs to know the text encoders and tokenizers they need to initialize for utilizing the encode_prompt_with_class_method(). This might make the developer experience a little bit convoluted compared to approach 2.

Apparoach 2 -- making `encode_prompt()` function correctly with valid pipeline initialization

We support initializing pipelines with some model-level components set to None. So, this for example:

from transformers import T5EncoderModel
from diffusers import PixArtAlphaPipeline
import torch

pipe = PixArtAlphaPipeline.from_pretrained(
    "PixArt-alpha/PixArt-XL-2-1024-MS",
    transformer=None
)
...

And then the users should be able to call pipe.encode_prompt(...). I like this approach better than Approach 1 because:

We are not introducing any new classmethod variants here.
Developers initialize the pipelines almost the same way they would. They just set the components unnecessary for running encode_prompt() to None.
Since our pipeline components can be reused to initialize other pipelines this should not lead to any memory wastage.
Users can still pass any fine-tuned version of text encoder if they way while initializing the pipeline. All of it should work given compatibility is guaranteed.

@yiyixuxu @DN6 would love to know what you think. After we reach a consensus, I will start the work.

Jun 28 '24 03:06 sayakpaul

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

Sep 14 '24 15:09 github-actions[bot]

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

Nov 09 '24 15:11 github-actions[bot]

Gentle ping if this is still planned, or if we're going to keep things as-is and improve how this works in modular diffusers

Nov 18 '24 20:11 a-r-r-o-w

We can close this for now as most pipelines provide encode_prompt() implementation that can work with just the text encoders loaded. So, I guess okay.

Nov 19 '24 01:11 sayakpaul

Rethinking the `encode_prompt()` method in pipelines

Apparoach 1 -- making encode_prompt() a classmethod

Problems

Apparoach 2 -- making encode_prompt() function correctly with valid pipeline initialization

Apparoach 1 -- making `encode_prompt()` a `classmethod`

Apparoach 2 -- making `encode_prompt()` function correctly with valid pipeline initialization