InternVideo Method of running evaluation on MSR-VTT dataset

Thanks for the paper and the open sourcing the code base.

I would like to know how evaluation is performed on the MSR-VTT dataset for zero shot text to video retrieval.

Are the metrics reported for MSR-VTT for the entire test split (~ 2990 videos, 59800 captions) or for 1kA subset (~ 1000 videos or 20000 captions)?
Are each of the caption (20 for each video) used to perform retrieval and find the recall metrics?
Are errors being accounted for?
- As the captions are not too descriptive and similar types of videos / captions exists, how are errors adjusted? For example, one of the caption for video7960 is a band performing in a small club but video8978 fits the same profile. Another caption for the same video video7960 is a group of boys and girls are dancing but video9957 also can be considered correct if retrieved. I will be happy to provide more such examples.

Looking forward for your clarification. Thanks!

Aug 06 '24 07:08 sartaki

Hi! In the latest version, we follow Unmasked Teacher to conduct the evaluation. Please check the code and meta data~

For Q1, we use 1k subset for testing. For Q2, only one caption for each video.

Aug 12 '24 12:08 Andy1621

Thanks @Andy1621. Will look at the link you pointed and get back to you if I have any doubts. Thanks.

Aug 12 '24 14:08 sartaki