BLIP icon indicating copy to clipboard operation
BLIP copied to clipboard

BLIP Image Captioning GradCAM?

Open gwyong opened this issue 2 years ago • 8 comments

Hi, I used BlipForConditionalGeneration from transformers for image captioning. I want to visualize the reason of generated caption (word by word) like GradCAM.

I found a code from Albef (https://github.com/salesforce/ALBEF/blob/main/visualization.ipynb), but it used an image-text matching model, not image captioning model.

Can you give me any hints or simple codes for this?

gwyong avatar May 22 '23 05:05 gwyong

Hi, you can look at our code in LAVIS, which provides gradcam computation function for BLIP image-text matching model https://github.com/salesforce/LAVIS/blob/a9939492f8f992d03088e7575bc711089b06544a/lavis/models/blip_models/blip_image_text_matching.py#L151

LiJunnan1992 avatar May 23 '23 02:05 LiJunnan1992

Does it mean, only image-text matching model can perform gradcam? My model is image captioning model, (see this https://huggingface.co/docs/transformers/model_doc/blip#transformers.BlipForConditionalGeneration)

If it only supports image-text matching model, do I need to make another image-text matching model for gradcam?

gwyong avatar May 23 '23 02:05 gwyong

You can adapt the gradcam code to work with an image captioning model.

LiJunnan1992 avatar May 23 '23 03:05 LiJunnan1992

Thank you I will try it.

gwyong avatar May 23 '23 04:05 gwyong

Hi, I am also working on the visualization that goes beyond the image-text matching model, and I've encountered some difficulties when calling 'attn_gradients' and 'attention_map'. Have you had any success with this and if so can you share the code or provide some guidance? Thank you very much!

Michi-3000 avatar May 23 '23 16:05 Michi-3000

Sure if I solve it, I will let you know.

gwyong avatar May 23 '23 19:05 gwyong

Sure if I solve it, I will let you know.

Did you manage to solve this?

dip9811111 avatar Oct 03 '23 09:10 dip9811111