transformers Auto model & pipeline for image-text-to-image-text models

Feature request

This is a tracker issue for work on interleaved in-and-out image-text generation.

There are now >= 4 open-source models that can do interleaved image-text generation--and many more are expected to be released. Thus, it would now be practical & useful for us to (1) add native support for such models and (2) standardize the logic flow of data through processors and pipelines as done in https://github.com/huggingface/transformers/issues/31911 and https://github.com/huggingface/transformers/pull/32472

Model	Github	Notes	PR
Anole	https://github.com/GAIR-NLP/anole	-	https://github.com/huggingface/transformers/pull/32013
Chameleon	https://github.com/facebookresearch/chameleon	-	https://github.com/huggingface/transformers/pull/32013
Llava-NeXT-Interleaved	https://github.com/LLaVA-VL/LLaVA-NeXT	-	-
Lumina-mGPT	https://github.com/Alpha-VLLM/Lumina-mGPT	-	-
Transfusion	-	Not open-source (yet, perhaps)	-
XGen-MM	https://github.com/salesforce/LAVIS/tree/xgen-mm	The paper & the github repo don't actually demonstrate interleaved image-text generation yet, but they did train the model on such datasets & the model architecture(s) is perfectly suited for it	-

Initial work for Chameleon & Anole can be found here: https://github.com/huggingface/transformers/pull/32013 for reference.

Notes:

We explicitly exclude models that can only do text-only generation or image-only generation. We also exclude models that can do image-text generation but not in an interleaved manner.
As I've demonstrated in my repo, explicitly implementing the Finite State Machine (FSM) for switching between text-generation and image-generation modes as done in Chameleon's repo is not necessary. Implicitly implementing the FSM with Logits Processors suffices. Although more work is needed on finding the most efficient implementation.

TODOs:

[ ] Add support for interleaved image-text generation with:
- [x] Chameleon -> https://github.com/huggingface/transformers/pull/32013
- [x] Anole -> https://github.com/huggingface/transformers/pull/32013
- [ ] Lumina-mGPT
- [ ] Transfusion
- [ ] XGen-MM
[ ] Add auto model for image-text-to-image-text
- [ ] [Optional] Add auto model for image-to-image-text
- [ ] [Optional] Add auto model for text-to-image-text
[ ] Add pipeline for image-text-to-image-text
- [ ] [Optional] Add pipeline for image-to-image-text
- [ ] [Optional] Add pipeline for text-to-image-text
[ ] Benchmark different implementations of Logits Processors & FSMs for switching between text-generation and image-generation modes

Motivation

To make benchmarking and evaluating models for interleaved image-to-text tasks saner
To continue work on Multimodal In-and-Out, Interleaved Structured Generation: https://github.com/leloykun/mmsg

Your contribution

I've already started work on Chameleon & Anole here: https://github.com/huggingface/transformers/pull/32013

But I'm currently blocked by (1) not having enough time due to other responsibilities and (2) not having enough compute resources.

Any help would be appreciated!

Aug 22 '24 01:08 leloykun

FYI @NielsRogge and @merveenoyan , you've been discussing recently tags for these kinds of models on the hub

Aug 22 '24 07:08 zucchini-nlp

@leloykun saw your comment on issue #33905 (Implement LlamaGen for Image Generation)

Want to work on these issues, can you tell where to begin with ? I am reading #31911 as you mentioned above

Oct 29 '24 11:10 GargDivanshu

@GargDivanshu You might also wanna take a look at #32013

You can start by adding some of the missing tests and such to gain familiarity with the code there. And once you're ready, I can help you implement multimodal in-and-out for the other models.

Oct 29 '24 11:10 leloykun

perfect, moving to #32013

Oct 29 '24 11:10 GargDivanshu

@zucchini-nlp I think this falls under any-to-any in Hub side but not sure if in transformers we should have a separate pipeline given we don't have a ton of these models as of now and given the shift to any-to-any we will have to have another pipeline for models that can take audio input or output on top of the modalities here. @NielsRogge

Nov 05 '24 08:11 merveenoyan

Yes, I agree it should be any-to-any. Was just adding you in the loop since some contributors are working on adding these types of models :)

Nov 05 '24 10:11 zucchini-nlp