Evaluation of Finetuned Model on SthV2 dataset Got Extremely Low Performance

Open caidonkey opened this issue 1 year ago • 1 comments

Thank you for your great work!

I downloaded the finetuned model provided in your model zoo: https://huggingface.co/OpenGVLab/InternVideo2-Stage1-1B-224p-f8-thSth/blob/main/1B_ft_ssv2_f8.pth (with 77.1% topp-1 accuracy reported on SthV2) and prepared the dataset SthV2 according to your instructions (though may be a bit vague).

And I evaluated the model using most of the parameters provided in the script: https://github.com/OpenGVLab/InternVideo/blob/main/InternVideo2/single_modality/scripts/finetuning/full_tuning/ssv2/1B_ft_ssv2_f8.sh as below:

python run_finetuning.py
--model internvideo2_1B_patch14_224 --data_path [our data path] --prefix [our data path] --data_set SSV2 --filename_tmpl img_{:05}.jpg --no_use_decord --nb_classes 174 --finetune [our path]/OpenGVLab--InternVideo2-Stage1-1B-224p-f8-SthSth/1B_ft_ssv2_f8.pth --log_dir [our path]/logs/1B_ft_ssv2_f8 --output_dir [our path]/1B_ft_ssv2_f8 --batch_size 8 --num_sample 2 --input_size 224 --short_side_size 224 --save_ckpt_freq 100 --num_frames 8 --num_workers 12 --warmup_epochs 3 --tubelet_size 1 --epochs 8 --lr 1e-4 --drop_path 0.3 --layer_decay 0.915 --use_checkpoint --checkpoint_num 6 --layer_scale_init_value 1e-5 --opt adamw --opt_betas 0.9 0.999 --weight_decay 0.05 --test_num_segment 2 --test_num_crop 3 --dist_eval --enable_deepspeed --bf16 --zero_stage 1 --test_best --eval

With raw images or videos as input, we both got extreme low evaluation results (0.59% top-1 and 2.80% top-5 accuracies using raw images as input).

Would you kindly help to check what might be the reason? Is it a problem with dataset preparation or parameter configurations?

Thank you very much for your time.

Jul 16 '24 08:07 caidonkey

The top-1 accuracy of 0.59% seems to be a random guess (1/174).

I also evaluated ssv1 model (https://huggingface.co/OpenGVLab/InternVideo2-Stage1-1B-224p-f8-SthSth/blob/main/1B_ft_ssv1_f8.pth) on SSV1 dataset. The top-1 and top-5 accuracies are 0.50% and 2.42%.

I have checked the model weights loaded and compared them with the weights in the evaluation forward path (fc_norm & head) and they are the same.

Jul 19 '24 02:07 caidonkey