diffusers CLIP model support for pipelines

I am not sure if this has been considered, or already on the roadmap, but I'd love to be able to just throw a CLIP model ID at a pipe and have it download, and use said model.

I see HF has the capability to do this, and I have seen the clip guided diffusion example Colab, but it seems to be just implemented for text2img and not all the pipelines.

I tried searching PRa and issues and repo from anything related to it being planned, but I don't see anything.

On a side note, it would be cool to see pipes merged and call the necessary modes via enum or something like pipe.IMG2IMG as first param for a pretrained pipe setup.

PS loving these latest commits! Wow! Getting faster, and more flexibility.

Oct 06 '22 06:10 WASasquatch

Hi, thanks for the issue!

It may not be possible to support clip guidance in all pipelines as we want to keep the pipelines simple so any user can modify it according to their needs. pipelines are meant to be an example of how a certain task can be done, so they may not support all the functionality. We encourage users to take the pipelines and modify it according to their needs.

Also, we just released community pipelines, which will allow any community pipeline to be loaded from diffusers, you can see how to load the clip guided pipeline easily in the doc

from diffusers import DiffusionPipeline
from transformers import CLIPFeatureExtractor, CLIPModel

clip_model_id = "laion/CLIP-ViT-B-32-laion2B-s34B-b79K"

feature_extractor = CLIPFeatureExtractor.from_pretrained(clip_model_id)
clip_model = CLIPModel.from_pretrained(clip_model_id)

pipeline = DiffusionPipeline.from_pretrained(
    "CompVis/stable-diffusion-v1-4",
    custom_pipeline="clip_guided_stable_diffusion",
    clip_model=clip_model,
    feature_extractor=feature_extractor,
)

with this anyone can share their pieplines with the community very easily. Can't wait to see a community contribution for the merged pipelines, we can add it in the community pipelines here https://github.com/huggingface/diffusers/tree/main/examples/community. Feel free to take a shot at it if you want, happy to help :)

Oct 07 '22 09:10 patil-suraj

Thanks for the example. Is it safe to assume you can override the feature extractor and clip model of any pipe without having to reimplement your whole system for custom, and possibly unstable pipes?

Imo this isn't intuitive for most the community and leaves access to HF clip models to the code affluent. That sort of limited arbitrary access on the basis of experience is sad.

Oct 16 '22 14:10 WASasquatch

Hey @WASasquatch,

It'd be really nice if you could propose how the code could look instead so that we can try to go from "sad" to "happy" :slightly_smiling_face:

Oct 17 '22 17:10 patrickvonplaten

I've been thinking for a long time that pipes [of a type] should be merged.

Stable Diffusion pipeline for example should have all it's basic pipelines as one Stable Diffusion pipeline

StableDiffusionPipeline (similar to how it is now for text2img) except from this method one can use a ID, or ENUM to define the type of pipe it is.

This means the new StableDiffusionPipeline able to dynamically accept optional init images or masks on the basis of the mode it's in.

Now StableDiffusionPipeline is like a metropolitan office building. It has levels (modes) you can quickly access. It's no longer like a rural office park complex where you must take a golf cart and drive on over to the next building for that department. :)

It's a effecient structure for high input/output departments that need to work together (such as text2img -> img2img/inpaint workflows).

You wouldn't need make sperate pipes, or overwrite pipes but can dynamically change the mode it's fed. Having one pipe dynamically accepting and doing the work of all 3 methods.

I feel the benefits here are:

File structure bloat is reduced with pipes of a type merged
Community extensions of said pipes will be all-encompassing to current diffuser pipe offerings without reimplementing the same code across multiple pipes to maintain
Access on the users API side is improved and can reduce script bloat on pipe setups
Code can be more effecient and effective using a dynamic approach to what pipe is used via ID/Enum
Happy HF Community ☺️

Cons:

Users would no doubt need to adjust their current implementations (but in diffusers infancy, this has happened frequently already as it matures)
???

Oct 17 '22 20:10 WASasquatch

To add further, it's possible that the StableDiffusionPipeline may not even need an ID/Enum. It would be logical enough to assume it's img2img if a INIT is provided, and no MASK is provided. It's also safe to assume it's inpaint if a INIT and MASK is provided. This is how I do my detection as is. If both are None then just use text2img.

In my case, I also use joblib to cache my pipes, as loading them takes just 2 seconds vs around 24 seconds switching pipes via Diffusers (assuming there is no downloads needed). Diffusers could incorporate its own pipe caching to speed up the repo. Joblib supports the entire pipe object without issues as a caching option.

Oct 17 '22 20:10 WASasquatch

I've been thinking for a long time that pipes [of a type] should be merged.

Stable Diffusion pipeline for example should have all it's basic pipelines as one Stable Diffusion pipeline

StableDiffusionPipeline (similar to how it is now for text2img) except from this method one can use a ID, or ENUM to define the type of pipe it is.

This means the new StableDiffusionPipeline able to dynamically accept optional init images or masks on the basis of the mode it's in.

Now StableDiffusionPipeline is like a metropolitan office building. It has levels (modes) you can quickly access. It's no longer like a rural office park complex where you must take a golf cart and drive on over to the next building for that department. :)

It's a effecient structure for high input/output departments that need to work together (such as text2img -> img2img/inpaint workflows).

You wouldn't need make sperate pipes, or overwrite pipes but can dynamically change the mode it's fed. Having one pipe dynamically accepting and doing the work of all 3 methods.

I feel the benefits here are:

File structure bloat is reduced with pipes of a type merged

Community extensions of said pipes will be all-encompassing to current diffuser pipe offerings without reimplementing the same code across multiple pipes to maintain

Access on the users API side is improved and can reduce script bloat on pipe setups

Code can be more effecient and effective using a dynamic approach to what pipe is used via ID/Enum

Happy HF Community ☺️

Cons:

Users would no doubt need to adjust their current implementations (but in diffusers infancy, this has happened frequently already as it matures)

???

I love that idea

Oct 17 '22 21:10 dblunk88

We have a long issue regarding "merging" stable diffusion pipeline into one, could you please comment there: https://github.com/huggingface/diffusers/issues/551

It would be nice if this issue could focus on the topic of "CLIP model support for pipelines"

Oct 18 '22 19:10 patrickvonplaten

We have a long issue regarding "merging" stable diffusion pipeline into one, could you please comment there: #551

It would be nice if this issue could focus on the topic of "CLIP model support for pipelines"

Right. Sounds good, though I feel this is directly related, because the fact this isn't already a feature is why the CLIP Guided Diffusion pipeline is inherited from a basic text2image pipeline. It's the root cause of the issue that lead us here. Because it requires anyone donating time to a community pipelines to reimplement the same thing across pipes to be encompassing to Diffuser features.

Oct 18 '22 23:10 WASasquatch

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

Nov 12 '22 15:11 github-actions[bot]