InternVideo Wrong results in Action Recognition task.

Hi! I have tried Internvideo2-1B-clip in the action recognition task on K400 dataset, I try to use the model without the dataset class you designed. So what I do in vision is catching 8 frames from video, transform it using test_transform, feed the processed clip into the vision encoder to get the 1x768 feature. In text I just use the k400_categories.txt and kinetics_prompt you offered, after the text encoder it's 400x16x768 features. Finally I get these two features in get_sim, and get a rank of the categories, but the result is very bad. the answer is always not in the top5 choices, the model seems to randomly rank the categories. I don't know if there is any wrong. the model I use is chinese_alpaca_lora_7b, InternVideo2-stage2_1b-224p-f4.pt, internvl_c_13b_224px.pth, InternVideo2_CLIP_1B.pth.

Jun 01 '24 10:06 zhengrongz

I also don't use flashattn, deepspeed,fused_rmsnorm and fused_mlp, but I don't think it will influence the inference result.

Jun 01 '24 10:06 zhengrongz

Hi! Can you try to reproduce the results for some small dataset like UCF101? Thus you can check whether you have load the weights correctly.

Jun 01 '24 16:06 Andy1621

@zhengrongz Would you kindly provide a code snipped about loading the model form a pth file? I am still struggling without and can not find a proper documentaion.

Sep 17 '24 09:09 MH-Python

@zhengrongz , can you point out to the inference code script that you have used ?

Nov 11 '24 11:11 stiwarifh

^, please put a method to test the code, and make to make inferences, its very hard to reproduce the results.

Nov 29 '24 08:11 sourenpash