Junnan Li
Junnan Li
Hi, you can refer to the code here for dataloading of text-video qa: https://github.com/salesforce/ALPRO. Thanks!
We use the VQA model to generation answers: https://github.com/salesforce/BLIP/blob/48211a1594f1321b00f14c9f7a5b4813144b2fb9/models/blip_vqa.py#L85 To handle videos, we simply concatenate frame features and pass them to the text decoder.
Hi, my implementation of ViT is based on the timm codebase. You might want to try the pretrained weights from timm.
Hi, we are currently working on a demo for retrieval
Hi, it could be related to the dataloader.
You can encode the image once but repeat the image encoding multiple times along the batch dimension.
Hi, note that the multimodal feature has not been optimized for cosine-similarity. The unimodal features can be used to compute cosine-similarity because of the image-text contrastive loss.
You can compute the cosine similarity of their image embeddings
Please refer to this code in the demo: `image_feature = model(image, caption, mode='image')[0,0]`