Auto model & pipeline for image-text-to-image-text models
Feature request
This is a tracker issue for work on interleaved in-and-out image-text generation.
There are now >= 4 open-source models that can do interleaved image-text generation--and many more are expected to be released. Thus, it would now be practical & useful for us to (1) add native support for such models and (2) standardize the logic flow of data through processors and pipelines as done in https://github.com/huggingface/transformers/issues/31911 and https://github.com/huggingface/transformers/pull/32472
| Model | Github | Notes | PR |
|---|---|---|---|
| Anole | https://github.com/GAIR-NLP/anole | - | https://github.com/huggingface/transformers/pull/32013 |
| Chameleon | https://github.com/facebookresearch/chameleon | - | https://github.com/huggingface/transformers/pull/32013 |
| Llava-NeXT-Interleaved | https://github.com/LLaVA-VL/LLaVA-NeXT | - | - |
| Lumina-mGPT | https://github.com/Alpha-VLLM/Lumina-mGPT | - | - |
| Transfusion | - | Not open-source (yet, perhaps) | - |
| XGen-MM | https://github.com/salesforce/LAVIS/tree/xgen-mm | The paper & the github repo don't actually demonstrate interleaved image-text generation yet, but they did train the model on such datasets & the model architecture(s) is perfectly suited for it | - |
Initial work for Chameleon & Anole can be found here: https://github.com/huggingface/transformers/pull/32013 for reference.
Notes:
- We explicitly exclude models that can only do text-only generation or image-only generation. We also exclude models that can do image-text generation but not in an interleaved manner.
- As I've demonstrated in my repo, explicitly implementing the Finite State Machine (FSM) for switching between text-generation and image-generation modes as done in Chameleon's repo is not necessary. Implicitly implementing the FSM with Logits Processors suffices. Although more work is needed on finding the most efficient implementation.
TODOs:
- [ ] Add support for interleaved image-text generation with:
- [x] Chameleon -> https://github.com/huggingface/transformers/pull/32013
- [x] Anole -> https://github.com/huggingface/transformers/pull/32013
- [ ] Lumina-mGPT
- [ ] Transfusion
- [ ] XGen-MM
- [ ] Add auto model for image-text-to-image-text
- [ ] [Optional] Add auto model for image-to-image-text
- [ ] [Optional] Add auto model for text-to-image-text
- [ ] Add pipeline for image-text-to-image-text
- [ ] [Optional] Add pipeline for image-to-image-text
- [ ] [Optional] Add pipeline for text-to-image-text
- [ ] Benchmark different implementations of Logits Processors & FSMs for switching between text-generation and image-generation modes
Motivation
- To make benchmarking and evaluating models for interleaved image-to-text tasks saner
- To continue work on Multimodal In-and-Out, Interleaved Structured Generation: https://github.com/leloykun/mmsg
Your contribution
I've already started work on Chameleon & Anole here: https://github.com/huggingface/transformers/pull/32013
But I'm currently blocked by (1) not having enough time due to other responsibilities and (2) not having enough compute resources.
Any help would be appreciated!
FYI @NielsRogge and @merveenoyan , you've been discussing recently tags for these kinds of models on the hub
@leloykun saw your comment on issue #33905 (Implement LlamaGen for Image Generation)
Want to work on these issues, can you tell where to begin with ? I am reading #31911 as you mentioned above
@GargDivanshu You might also wanna take a look at #32013
You can start by adding some of the missing tests and such to gain familiarity with the code there. And once you're ready, I can help you implement multimodal in-and-out for the other models.
perfect, moving to #32013
@zucchini-nlp I think this falls under any-to-any in Hub side but not sure if in transformers we should have a separate pipeline given we don't have a ton of these models as of now and given the shift to any-to-any we will have to have another pipeline for models that can take audio input or output on top of the modalities here. @NielsRogge
Yes, I agree it should be any-to-any. Was just adding you in the loop since some contributors are working on adding these types of models :)