transformers Add support for BLIP and GIT in image-to-text and VQA pipelines

Feature request

BLIP and GIT are 2 recent additions in the library, providing state-of-the-art performance for tasks like image captioning and visual question answering (VQA). GIT is even capable of video captioning and video QA.

Hence it makes sense to support them in our image-to-text and VQA pipelines.

Motivation

Having support for better models in pipelines is very desired!

See also a request for it here: https://discuss.huggingface.co/t/support-for-different-models-in-text-to-image-pipeline/29504

Your contribution

I can assist in adding support, see #18446 as a very similar case

Jan 13 '23 15:01 NielsRogge

Hi @NielsRogge , can work on it?

Jan 13 '23 21:01 atturaioe

Sure @atturaioe feel free to start working on it

Jan 16 '23 21:01 NielsRogge

I am writing to inquire about the possibility of me starting work on this issue. @NielsRogge can I contribute?

Jan 17 '23 08:01 02shanks

@NielsRogge is this issue still open for contribution?

Feb 01 '23 05:02 pikaduck

Yes

Feb 01 '23 07:02 NielsRogge

@NielsRogge If nobody is working on it, I would like to pick up the issue.

Feb 08 '23 13:02 RaghavPrabhakar66

I would like to pick the issue if its still available.

Feb 16 '23 21:02 strankid

@NielsRogge is this issue still open to contribute . I would like to work on it

Apr 17 '23 14:04 Tanmaypatil123

Support for BLIP in the image-to-text pipeline has been added in #21904. GIT can be added as explained in this comment, feel free to open a PR.

Support for the VQA pipeline still needs to be added for both, also there contributions are welcome.

Apr 18 '23 12:04 NielsRogge

@NielsRogge can I work on this issue??

Apr 21 '23 19:04 sushmanthreddy

Hello @NielsRogge !

I would like to work on this issue (add support for VQA to GIT model) as a first contribution.

But before I start, I have a question :

Currently the only model implementing the VQA pipeline is ViltForQuestionAnswering, it does the task using classification

However in GIT paper they say that :

For VQA, the input question is treated as a text prefix, and the answer is generated in an auto-regressive way. Furthermore, we present a new generation-based scheme for ImageNet classification, where the predicted labels come directly from our generative model without pre-defining the vocabulary.

So I wonder if I should implement it as a classifier or should I follow the paper ?

Thanks

May 02 '23 14:05 marechaux

Hi @marechaux, we will need to implement the 2 different approaches in the VQA pipeline. ViLT and GIT indeed solve VQA entirely different (ViLT is a classifier whereas GIT is a generative GPT-like model).

May 03 '23 07:05 NielsRogge

Support for BLIP in the image-to-text pipeline has been added in #21904. GIT can be added as explained in this comment, feel free to open a PR.

Support for the VQA pipeline still needs to be added for both, also there contributions are welcome.

Hey @NielsRogge, took a shot at this. Am I correct in understanding that the ideal implementation of "microsoft/git-base" in the image-to-text pipeline would look something like this?

from transformers import AutoProcessor, GitForVision2Seq

processor = AutoProcessor.from_pretrained("microsoft/git-base")
model = GitForVision2Seq.from_pretrained("microsoft/git-base")

pipe = pipeline("image-to-text", model=model, image_processor=processor.image_processor, tokenizer=processor.tokenizer)
print(pipe("https://www.wideopenpets.com/wp-content/uploads/sites/6/2021/12/Popular-Horse-Feature-Image.png"))

If so, I got this to work by:

Adding the GitForVision2Seq class and making it available for imports / in MODEL_FOR_VISION_2_SEQ_MAPPING_NAMES
Updating src/transformers/models/git/processing_git.py to use a custom GITImageProcessor. This GITImageProcessor is an exact copy of the CLIPImageProcessor that GitProcessor already wraps, with the only difference being how the GITImageProcessor.preprocess method returns data when being called by the ImageToTextPipeline.preprocess method (Basically adding the input_ids key with a value of None ).

So the GITImageProcessor.preprocesor ends with this:

data = {"pixel_values": images} 
return_data = BatchFeature(data=data, tensor_type=return_tensors) 
return_data['input_ids'] = None 
return return_data

rather than the CLIPImageProcessor.preprocessor method returning this

data = {"pixel_values": images} 
return BatchFeature(data=data, tensor_type=return_tensors)

Curious your thoughts on this approach. How would this would affect other GIT image processing workflows (i.e. VQA, etc.)? Could we can use a conditional to account for those?

May 09 '23 23:05 skiss10

Thanks for taking a stab at this. I'm fine with adding a GitForVision2Seq (as proposed by @Narsil) however it'd be great to not having to add a custom GITImageProcessor. What's the reason this is added? Is it only to include "input_ids" which are set to None?

May 10 '23 16:05 NielsRogge

Exactly this - 'only to include "input_ids" which are set to None?'

I see how adding an entirely new GITImageProcessor seems excessive when all it would do is add the Input_ids : None key value pair to data being returned from the .preprocess method.

As you describe here, https://github.com/huggingface/transformers/issues/21514#issuecomment-1446359970, Once we hit the preprocess method in ImageToTextPipeline and map the model to git, the model_inputs are returned (via the CLIPImageProcessor through the GITProcessor in processing_git.py) without the input_ids key. So AFAIK, the best we can do is modify the return value of the CLIPImageProcessor.preprocess method without changing the CLIPImageProcessor class by replicating the CLIPImageProcessor, rebranding it as a GITImageProcessor, and modify the .preprocess method.

Let me know if that works or if you feel there is a better approach. Is the idea that there would be some way to do this within GitForVision2Seq?

As an aside, I read some best practices for working in the transformers library (https://huggingface.co/transformers/v4.10.1/add_new_model.html#general-overview-of-transformers). Would it be preferable to copy the entire CLIPImageProcessor class as GITImageProcessor within processing_git.py or do something more like this within processing_git.py.

class GITImageProcessor(CLIPImageProcessor):
    def preprocess(self, *args, **kwargs):
        # Call the original preprocess method
        return_data = super().preprocess(*args, **kwargs)
        
        # Add 'input_ids' key to the data
        return_data['input_ids'] = None

        return return_data

May 10 '23 18:05 skiss10

Hmm I don't get why input_ids need to be set to None. Could you clarify?

This example shows that you only need to pass pixel_values to the generate method to do image captioning.

May 11 '23 20:05 NielsRogge

Hello, it seems that the BLIP for the image to text pipeline has been completed, however that the VQA pipeline for both BLIP & GIT are not complete, along with the image to text pipeline for GIT. @marechaux how is the VQA going for GIT?

May 20 '23 21:05 jprivera44

Hi! I'm also interested in helping out if we can divide the work :)

May 27 '23 19:05 jucamohedano

Hey @NielsRogge , I was working on VQA pipeline for BLIP but i am confused how can i give pixel_values to _forward method in VisualQuestionAnsweringPipeline (src) because BLIP requires pixel values and those are generated by preprocessor . Sorry if this is silly question because this is my first open source contribution .

Jun 28 '23 16:06 Tanmaypatil123

Hi @Tanmaypatil123 there's already this PR: #23348. Feel free to take it over/improve it

Jun 29 '23 17:06 NielsRogge

Hello, can I work on this?

Jul 03 '23 20:07 PinakShome

Hi Team, Can I start working on it ?

Jul 05 '23 05:07 Vipul-Pandey-22

Hi @NielsRogge, I would like to try to add GIT for VQA as my first contribution, is it ok? I looked at #23348 , and I want to know if it is fine to return the full generated text, I make it work locally, so I could prepare a PR if no one else is working on this.

I believe the input_ids or its lenght could be used in the postprocess of VisualQuestionAnsweringPipeline to remove the prompt/prefix, like in TextGenerationPipeline, but it will require to do a refactor in _forward in VisualQuestionAnsweringPipeline to return also the input_ids.

e.g. https://github.com/huggingface/transformers/blob/fe3c8ab1af558b95f67f5fafc0c55f09fd2b09db/src/transformers/pipelines/text_generation.py#L294-L305

Aug 14 '23 06:08 jpizarrom

Is this still open for contribution? Would love to help out.

Nov 21 '23 01:11 astern21

Hi @astern21, I started a draft PR, but didn't get to finish it, and now I'm not really working on it.

Nov 21 '23 07:11 jpizarrom