The generation of the sub-task 【fine-grained action】 in MVBench

Open yxsysu opened this issue 1 year ago • 1 comments

Hello authors,

In your paper, you mention that the candidates of the question in the sub-task【fine-grained action】 are generated using UMT-L. Could you please clarify whether you use a pre-trained UMT-L model to encode the videos and the 339 categories (the total number of categories in Moments in Time dataset), and then compute the text-visual similarity?

Thank you!

Dec 16 '24 12:12 yxsysu

Yes, we use the UMT-L model to encode the video, and then select the top-10 similar types based on the prediction score.

Dec 27 '24 15:12 Andy1621