Junnan Li comments

Results 192 comments of


                                            Junnan Li

code on text-video qa

Hi, you can refer to the code here for dataloading of text-video qa: https://github.com/salesforce/ALPRO. Thanks!

We use the VQA model to generation answers: https://github.com/salesforce/BLIP/blob/48211a1594f1321b00f14c9f7a5b4813144b2fb9/models/blip_vqa.py#L85 To handle videos, we simply concatenate frame features and pass them to the text decoder.

Pretrained Network

Hi, my implementation of ViT is based on the timm codebase. You might want to try the pretrained weights from timm.

Demo of I2T Retrival

Hi, we are currently working on a demo for retrieval

Continuously increasing RAM with Pre-training

Hi, it could be related to the dataloader.

Can I ask more than 1 question simultaneously through the blip_vqa model?

You can encode the image once but repeat the image encoding multiple times along the batch dimension.

How to use BLIP for near-duplicate image and text pair detection?

Hi, note that the multimodal feature has not been optimized for cosine-similarity. The unimodal features can be used to compute cosine-similarity because of the image-text contrastive loss.

How to use BLIP for duplicate or near-duplicate images?

You can compute the cosine similarity of their image embeddings

How to use BLIP for duplicate or near-duplicate images?

Please refer to this code in the demo: `image_feature = model(image, caption, mode='image')[0,0]`

how to do batch prediction on feature extraction task?

Yes you can.