Ask-Anything Question regarding stage 4 HD image size

Hello,

Thank you for the great work!

For stage 4 (instruction tuning with HD data), the current code seems to resize/crop image to 224x224: https://github.com/OpenGVLab/Ask-Anything/blob/main/video_chat2/scripts/videochat_mistral/config_7b_hd_stage4.py#L21 https://github.com/OpenGVLab/Ask-Anything/blob/main/video_chat2/dataset/init.py#L73

which means it's actually using 224x224 frames for training. Is that true? If so, what is this "HD" about? Or did I miss something?

Thank you!

Oct 17 '24 08:10 jpan72

224 is the input resolution of our vision encoder. You can refer to the dynamic resolution setting of HD https://github.com/OpenGVLab/Ask-Anything/blob/c3f07988b1db77ed24d706650d3cb23e3495a011/video_chat2/scripts/videochat_mistral/config_7b_hd_stage4.py#L85-L90

Oct 17 '24 11:10 yinanhe

Thank you for the swift response! I see how the dynamic resolution setting is working for HD training.

One followup question is - I saw the blocks is not used in videochat2 HD training. https://github.com/OpenGVLab/Ask-Anything/blob/main/video_chat2/dataset/hd_utils.py#L93

However, in the InternVL code that videochat2 code is referring to, they used the blocks to generate (local_size x local_size) sub-images: https://github.com/OpenGVLab/InternVL/blob/2d93b099ffbbf45d1db59710914f26fce4494104/README.md?plain=1#L752-L771

Does that mean videochat2 HD training doesn't use the sub-images, and uses the resized images instead? In that case, how does that work with the vision encoder 224x224 input setup?

Thank you!

Nov 07 '24 05:11 jpan72