Error when running text localization example
The issue is about the text localization example.
The input image is "../docs/_static/merlion.png" while the input caption is changed to "Merlion near marina bay. It is a city in Singapore. It is a very beautiful city located in Asia. It attract a lot of tourists to come at all seasons. There is a famous hotel in the picture. The picture is capture in night time."
Below is the error message:
gradcam, _ = compute_gradcam(model, img, txt, txt_tokens, block_num=7)
File "/data/code/LAVIS/lavis/models/blip_models/blip_image_text_matching.py", line 147, in compute_gradcam
cams = cams[:, :, :, 1:].reshape(visual_input.size(0), 12, -1, 24, 24) * mask
RuntimeError: The size of tensor a (35) must match the size of tensor b (48) at non-singleton dimension 2
Can you elaborate on how to fix this error?
Hi @yi-ming-qian , thanks for your interest.
Can you confirm you can run with the original caption with no issue?
The BLIP ITM model by default processes at most 35 tokens, while your longer inputs will be truncated.
You may reduce the input length, or increases max_txt_len.