InternVideo icon indicating copy to clipboard operation
InternVideo copied to clipboard

Some doubts about the absolute value of ViCLIP similarity

Open LiuHuijie6410 opened this issue 1 year ago • 4 comments

Thanks for such beautiful work! In the past, the similarity between video and text was usually calculating the similarity between each frame and text using text-image CLIP, and then take the average. If the text and video are aligned, the value calculated in this way is usually above 0.9. However, The value calculated using ViCLIP is only about 0.3. Could you explain the reason?

LiuHuijie6410 avatar Sep 09 '24 09:09 LiuHuijie6410

Yeah I want to ask this too, not sure how should we threshold out good matches given that cosine similarities are almost always lie between 0.2 -0.4. I can see that softmax(100* score) is used to get relative closeness among a set of candidates but this doesn't help to exclude unmatched candidates .

zmy1116 avatar Sep 20 '24 19:09 zmy1116

Hi folks, I'm encountering the same issue and am just wondering did you guys figure out this or found a way to threshold out irrelevant candidates?

Apologize for the direct mentions @shepnerd @leexinhao but I think once this is clear, the InternVideo models would definitely get far more adaptions from the community so wondering can you guys can help us on this, thanks in advance!

leemengtw avatar Dec 16 '24 10:12 leemengtw

Hi there, I'm facing the same issue: the similarity scores for text-visual sample matching consistently fall within the range of 0.2 to 0.3. Has anyone managed to solve this problem?

lq826311756 avatar Jun 23 '25 13:06 lq826311756

@leemengtw @zmy1116 @LiuHuijie6410 @lq826311756 Sorry for replying so late.

  1. If you use InternVideo2-clip, since we use contrastive learning for training, which does not supervise the absolute value of similarity, I think a relatively good threshold can only be determined through multiple attempts for your task.
  2. If you use InternVideo2-stage2, you could use the match head to get a match value between 0 and 1, https://github.com/OpenGVLab/InternVideo/blob/4b0b701512dcc4de6c27cabaff93b270bc14c14d/InternVideo2/multi_modality/tasks/retrieval_utils.py#L1107.
  3. I alse found that in MLLM-based embedding models, such as our recent work CaRe, this phenomenon has been alleviated. You can give it a try.

leexinhao avatar Jun 30 '25 06:06 leexinhao