Added ability to provide list of enabled embeddings for maybe_convert_prompt call
What does this PR do?
This PR adds the ability to provide a list of enabled embeddings for prompt conversion with the TextualInversionLoaderMixin.
Right now, in order to disable embedding, we need to call unload_textual_inversion, which modifies both tokenizer and text_encoder. However, if we want to enable it again, we have to call load_textual_inversion again, which is a bit cumbersome.
This implementation doesn't change the default behavior (if None is provided, all embeddings are enabled), but to make it work, we need to add the corresponding parameters to pipelines all over the library (~140 places). I am not sure, how it could be done easily, so any suggestions are welcome.
Before submitting
- [ ] This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
- [x] Did you read the contributor guideline?
- [ ] Did you read our philosophy doc (important for complex PRs)?
- [ ] Was this discussed/approved via a GitHub issue or the forum? Please add a link to it if that's the case.
- [ ] Did you make sure to update the documentation with your changes? Here are the documentation guidelines, and here are tips on formatting docstrings.
- [ ] Did you write any new necessary tests?
Who can review?
@sayakpaul your review and suggestions are kindly requested.
cc @linoytsaban
What is the use case for this? Could you maybe show us the workflow you'd take without this PR (like the one you mentioned in the PR description)?
Even if it's a little cumbersome, if it can be done easily and cleanly, I think users would still prefer that since it makes things explicit. I'd be in favor that, personally.
@linoytsaban @yiyixuxu what are your thoughts?
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
Please note that issues that do not follow the contributing guidelines are likely to be ignored.
gentle ping @kovalexal
Hi @sayakpaul! Sorry, this probably slipped under my radar.
My main point is that it would be great to be able to enable/disable embeddings without the additional need to modify any models in any way (I mean, there is no neccessity to modify the model's internal tensors, face tensor allocation/deallocation, etc).
The main use case is that my colleagues and I have created an internal library based on diffusers in the company where I work (probably this is an approach that all companies choose anyway). We started with the same philosophy that diffusers has for it's pipelines - a single pipeline for a single service, but soon it became a disaster to support and maintain with a small group of engineers (some pipelines were txt2img/img2img/inpaint, some were LPW pipelines, some pipelines used ControlNets/IP-Adapters/etc, some didn't). Eventually we came up with a solution where we have a single pipeline that can work with a required set of technologies and each parameter is optional (optional ControlNet/IP adapter hints, optional embeddings, optional LoRAs, etc.), so if the pipeline is instantiated with some base checkpoint + IP adapter, we need to be able to output the same result as the plain base checkpoint if no IP adapter hint was provided.
This approach works pretty well in our case, but unfortunately we came to a point where there are some architectural differences with basic diffusers pipelines (mostly because of the strict requirement for optional hints/loras/embeddings/etc), but still we are trying to reduce the number of differences with your pipelines to be able to use the latest advances in the area that you are integrating into diffusers.
Now we are trying to move from more monolithic services to a microservice architecture (to obviously reduce infrastructure costs), where there will be multiple microservices of diffusion pipelines (grouped by the same basic checkpoints), so the need to be able to work with optional hints and optional embeddings becomes critical.
So, to me it seems a good solution to introduce a new parameter inside this method (more of a KISS approach IMHO) instead of either creating a new handle to load/unload textual inversion (which should be called from somewhere), or load + process + unload principle (which should be done on every income request, and also introduce some caching to store embeddings somewhere, when there is no real need to do it). I know that this might be a rather rare situation (and of course we may choose a unique name for our embedding and do not unload at all), but still it would overcome some problems when a user loads an embedding with a common token. Also, for it to work properly, there is a need to change the processing of diffuses a bit (so, instead of replacement RARE_TOKEN -> RARE_TOKEN RARE_TOKEN_1 RARE_TOKEN_2 it would be better to do the following replacent RARE_TOKEN -> RARE_TOKEN_0 RARE_TOKEN_1 RARE_TOKEN_2 - this is how we do it internally).
Sorry for the long text, it would have been hard to make my point without some context.
@yiyixuxu Wdyt?
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
Please note that issues that do not follow the contributing guidelines are likely to be ignored.
@kovalexal
can you explain this? so this implementation in this PR is not sufficient?
This implementation doesn't change the default behavior (if None is provided, all embeddings are enabled), but to make it work, we need to add the corresponding parameters to pipelines all over the library (~140 places). I am not sure, how it could be done easily, so any suggestions are welcome.
@yiyixuxu Hi!
Yes, you are correct. I just wanted to make sure that this change does not violate your view of this part of the library, and if it does not, I can proceed with implementation. WDYT?
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
Please note that issues that do not follow the contributing guidelines are likely to be ignored.