Empirical Min
Thank you for your great work. I apologize for raising another issue, but this one has been troubling me.
When using your code to calculate Overall Consistency, specifically the cosine similarity between ViCLIP visual features and text features (line 49: vbench/overall_consistency.py), we are encountering negative results.
As we know, cosine similarity can indeed be less than 0. How do you ensure that this lower bound is exactly zero (line 55 scripts/constant.py)? Could you provide some insights on this?
Looking forward to your response.
@ziqihuangg
Just following up on my question. I'd appreciate any guidance when you have time. @ziqihuangg
@ziqihuangg
The lower bounds we refer to are not "theoretical", but "empirical" bounds used for normalizing the final score calculation. We have not claimed that the theoretical lower bound should be zero; in fact, it should be -1. When the ViCLIP score is below zero, it indicates that the video content and text description have "opposite semantic meaning." Positive scores suggest relevant semantics, while a score of zero typically indicates irrelevant but non-contradictory content. In the context of text-to-video generation, it is highly unusual for a video to exhibit "opposite semantic meaning" to its text prompt. Even in rare cases where this occurs, the average score across all samples would most likely still be above zero.