InternVideo icon indicating copy to clipboard operation
InternVideo copied to clipboard

Wrong results in Action Recognition task.

Open zhengrongz opened this issue 1 year ago • 5 comments

Hi! I have tried Internvideo2-1B-clip in the action recognition task on K400 dataset, I try to use the model without the dataset class you designed. So what I do in vision is catching 8 frames from video, transform it using test_transform, feed the processed clip into the vision encoder to get the 1x768 feature. In text I just use the k400_categories.txt and kinetics_prompt you offered, after the text encoder it's 400x16x768 features. Finally I get these two features in get_sim, and get a rank of the categories, but the result is very bad. the answer is always not in the top5 choices, the model seems to randomly rank the categories. I don't know if there is any wrong. the model I use is chinese_alpaca_lora_7b, InternVideo2-stage2_1b-224p-f4.pt, internvl_c_13b_224px.pth, InternVideo2_CLIP_1B.pth.

zhengrongz avatar Jun 01 '24 10:06 zhengrongz

I also don't use flashattn, deepspeed,fused_rmsnorm and fused_mlp, but I don't think it will influence the inference result.

zhengrongz avatar Jun 01 '24 10:06 zhengrongz

Hi! Can you try to reproduce the results for some small dataset like UCF101? Thus you can check whether you have load the weights correctly.

Andy1621 avatar Jun 01 '24 16:06 Andy1621

@zhengrongz Would you kindly provide a code snipped about loading the model form a pth file? I am still struggling without and can not find a proper documentaion.

MH-Python avatar Sep 17 '24 09:09 MH-Python

@zhengrongz , can you point out to the inference code script that you have used ?

stiwarifh avatar Nov 11 '24 11:11 stiwarifh

^, please put a method to test the code, and make to make inferences, its very hard to reproduce the results.

sourenpash avatar Nov 29 '24 08:11 sourenpash