Ask-Anything
Ask-Anything copied to clipboard
The generation of the sub-task 【fine-grained action】 in MVBench
Hello authors,
In your paper, you mention that the candidates of the question in the sub-task【fine-grained action】 are generated using UMT-L. Could you please clarify whether you use a pre-trained UMT-L model to encode the videos and the 339 categories (the total number of categories in Moments in Time dataset), and then compute the text-visual similarity?
Thank you!
Yes, we use the UMT-L model to encode the video, and then select the top-10 similar types based on the prediction score.