Evaluation of Finetuned Model on SthV2 dataset Got Extremely Low Performance
Thank you for your great work!
I downloaded the finetuned model provided in your model zoo: https://huggingface.co/OpenGVLab/InternVideo2-Stage1-1B-224p-f8-thSth/blob/main/1B_ft_ssv2_f8.pth (with 77.1% topp-1 accuracy reported on SthV2) and prepared the dataset SthV2 according to your instructions (though may be a bit vague).
And I evaluated the model using most of the parameters provided in the script: https://github.com/OpenGVLab/InternVideo/blob/main/InternVideo2/single_modality/scripts/finetuning/full_tuning/ssv2/1B_ft_ssv2_f8.sh as below:
python run_finetuning.py
--model
internvideo2_1B_patch14_224
--data_path
[our data path]
--prefix
[our data path]
--data_set
SSV2
--filename_tmpl
img_{:05}.jpg
--no_use_decord
--nb_classes
174
--finetune
[our path]/OpenGVLab--InternVideo2-Stage1-1B-224p-f8-SthSth/1B_ft_ssv2_f8.pth
--log_dir
[our path]/logs/1B_ft_ssv2_f8
--output_dir
[our path]/1B_ft_ssv2_f8
--batch_size
8
--num_sample
2
--input_size
224
--short_side_size
224
--save_ckpt_freq
100
--num_frames
8
--num_workers
12
--warmup_epochs
3
--tubelet_size
1
--epochs
8
--lr
1e-4
--drop_path
0.3
--layer_decay
0.915
--use_checkpoint
--checkpoint_num
6
--layer_scale_init_value
1e-5
--opt
adamw
--opt_betas
0.9
0.999
--weight_decay
0.05
--test_num_segment
2
--test_num_crop
3
--dist_eval
--enable_deepspeed
--bf16
--zero_stage
1
--test_best
--eval
With raw images or videos as input, we both got extreme low evaluation results (0.59% top-1 and 2.80% top-5 accuracies using raw images as input).
Would you kindly help to check what might be the reason? Is it a problem with dataset preparation or parameter configurations?
Thank you very much for your time.
The top-1 accuracy of 0.59% seems to be a random guess (1/174).
I also evaluated ssv1 model (https://huggingface.co/OpenGVLab/InternVideo2-Stage1-1B-224p-f8-SthSth/blob/main/1B_ft_ssv1_f8.pth) on SSV1 dataset. The top-1 and top-5 accuracies are 0.50% and 2.42%.
I have checked the model weights loaded and compared them with the weights in the evaluation forward path (fc_norm & head) and they are the same.