Some doubts about the absolute value of ViCLIP similarity
Thanks for such beautiful work! In the past, the similarity between video and text was usually calculating the similarity between each frame and text using text-image CLIP, and then take the average. If the text and video are aligned, the value calculated in this way is usually above 0.9. However, The value calculated using ViCLIP is only about 0.3. Could you explain the reason?
Yeah I want to ask this too, not sure how should we threshold out good matches given that cosine similarities are almost always lie between 0.2 -0.4. I can see that softmax(100* score) is used to get relative closeness among a set of candidates but this doesn't help to exclude unmatched candidates .
Hi folks, I'm encountering the same issue and am just wondering did you guys figure out this or found a way to threshold out irrelevant candidates?
Apologize for the direct mentions @shepnerd @leexinhao but I think once this is clear, the InternVideo models would definitely get far more adaptions from the community so wondering can you guys can help us on this, thanks in advance!
Hi there, I'm facing the same issue: the similarity scores for text-visual sample matching consistently fall within the range of 0.2 to 0.3. Has anyone managed to solve this problem?
@leemengtw @zmy1116 @LiuHuijie6410 @lq826311756 Sorry for replying so late.
- If you use InternVideo2-clip, since we use contrastive learning for training, which does not supervise the absolute value of similarity, I think a relatively good threshold can only be determined through multiple attempts for your task.
- If you use InternVideo2-stage2, you could use the match head to get a match value between 0 and 1, https://github.com/OpenGVLab/InternVideo/blob/4b0b701512dcc4de6c27cabaff93b270bc14c14d/InternVideo2/multi_modality/tasks/retrieval_utils.py#L1107.
- I alse found that in MLLM-based embedding models, such as our recent work CaRe, this phenomenon has been alleviated. You can give it a try.