InternVideo
InternVideo copied to clipboard
Method of running evaluation on MSR-VTT dataset
Thanks for the paper and the open sourcing the code base.
I would like to know how evaluation is performed on the MSR-VTT dataset for zero shot text to video retrieval.
- Are the metrics reported for MSR-VTT for the entire test split (~ 2990 videos, 59800 captions) or for 1kA subset (~ 1000 videos or 20000 captions)?
- Are each of the caption (20 for each video) used to perform retrieval and find the recall metrics?
- Are errors being accounted for?
- As the captions are not too descriptive and similar types of videos / captions exists, how are errors adjusted? For example, one of the caption for
video7960isa band performing in a small clubbutvideo8978fits the same profile. Another caption for the same videovideo7960isa group of boys and girls are dancingbutvideo9957also can be considered correct if retrieved. I will be happy to provide more such examples.
- As the captions are not too descriptive and similar types of videos / captions exists, how are errors adjusted? For example, one of the caption for
Looking forward for your clarification. Thanks!
Hi! In the latest version, we follow Unmasked Teacher to conduct the evaluation. Please check the code and meta data~
For Q1, we use 1k subset for testing. For Q2, only one caption for each video.
Thanks @Andy1621. Will look at the link you pointed and get back to you if I have any doubts. Thanks.