BLIP BLIP Image Captioning GradCAM?

Hi, I used BlipForConditionalGeneration from transformers for image captioning. I want to visualize the reason of generated caption (word by word) like GradCAM.

I found a code from Albef (https://github.com/salesforce/ALBEF/blob/main/visualization.ipynb), but it used an image-text matching model, not image captioning model.

Can you give me any hints or simple codes for this?

May 22 '23 05:05 gwyong

Hi, you can look at our code in LAVIS, which provides gradcam computation function for BLIP image-text matching model https://github.com/salesforce/LAVIS/blob/a9939492f8f992d03088e7575bc711089b06544a/lavis/models/blip_models/blip_image_text_matching.py#L151

May 23 '23 02:05 LiJunnan1992

Does it mean, only image-text matching model can perform gradcam? My model is image captioning model, (see this https://huggingface.co/docs/transformers/model_doc/blip#transformers.BlipForConditionalGeneration)

If it only supports image-text matching model, do I need to make another image-text matching model for gradcam?

May 23 '23 02:05 gwyong

You can adapt the gradcam code to work with an image captioning model.

May 23 '23 03:05 LiJunnan1992

Thank you I will try it.

May 23 '23 04:05 gwyong

Hi, I am also working on the visualization that goes beyond the image-text matching model, and I've encountered some difficulties when calling 'attn_gradients' and 'attention_map'. Have you had any success with this and if so can you share the code or provide some guidance? Thank you very much!

May 23 '23 16:05 Michi-3000

Sure if I solve it, I will let you know.

May 23 '23 19:05 gwyong

Sure if I solve it, I will let you know.

Did you manage to solve this?

Oct 03 '23 09:10 dip9811111