BLIP Image Captioning GradCAM?
Hi, I used BlipForConditionalGeneration from transformers for image captioning. I want to visualize the reason of generated caption (word by word) like GradCAM.
I found a code from Albef (https://github.com/salesforce/ALBEF/blob/main/visualization.ipynb), but it used an image-text matching model, not image captioning model.
Can you give me any hints or simple codes for this?
Hi, you can look at our code in LAVIS, which provides gradcam computation function for BLIP image-text matching model https://github.com/salesforce/LAVIS/blob/a9939492f8f992d03088e7575bc711089b06544a/lavis/models/blip_models/blip_image_text_matching.py#L151
Does it mean, only image-text matching model can perform gradcam? My model is image captioning model, (see this https://huggingface.co/docs/transformers/model_doc/blip#transformers.BlipForConditionalGeneration)
If it only supports image-text matching model, do I need to make another image-text matching model for gradcam?
You can adapt the gradcam code to work with an image captioning model.
Thank you I will try it.
Hi, I am also working on the visualization that goes beyond the image-text matching model, and I've encountered some difficulties when calling 'attn_gradients' and 'attention_map'. Have you had any success with this and if so can you share the code or provide some guidance? Thank you very much!
Sure if I solve it, I will let you know.
Sure if I solve it, I will let you know.
Did you manage to solve this?