Wrong results in Action Recognition task.
Hi! I have tried Internvideo2-1B-clip in the action recognition task on K400 dataset, I try to use the model without the dataset class you designed. So what I do in vision is catching 8 frames from video, transform it using test_transform, feed the processed clip into the vision encoder to get the 1x768 feature. In text I just use the k400_categories.txt and kinetics_prompt you offered, after the text encoder it's 400x16x768 features. Finally I get these two features in get_sim, and get a rank of the categories, but the result is very bad. the answer is always not in the top5 choices, the model seems to randomly rank the categories. I don't know if there is any wrong. the model I use is chinese_alpaca_lora_7b, InternVideo2-stage2_1b-224p-f4.pt, internvl_c_13b_224px.pth, InternVideo2_CLIP_1B.pth.
I also don't use flashattn, deepspeed,fused_rmsnorm and fused_mlp, but I don't think it will influence the inference result.
Hi! Can you try to reproduce the results for some small dataset like UCF101? Thus you can check whether you have load the weights correctly.
@zhengrongz Would you kindly provide a code snipped about loading the model form a pth file? I am still struggling without and can not find a proper documentaion.
@zhengrongz , can you point out to the inference code script that you have used ?
^, please put a method to test the code, and make to make inferences, its very hard to reproduce the results.