Does InternVideo2-1B-s2 use AudioEncoder?
When I read many articles about VFM, I often find that methods incorporating the audio modality tend to perform better than those using only video and text. Could you please tell me if the audio modality was also incorporated in the training of the InternVideo2-1B-s2 model?
No, you can try it by post pretraining InternVideo2-1B-s2 with audio-vision-text data like VALOR or VAST, we only introduce audio for InternVideo2-6B-s2.
No, you can try it by post pretraining InternVideo2-1B-s2 with audio-vision-text data like VALOR or VAST, we only introduce audio for InternVideo2-6B-s2.
Thank you for your response. I have a few additional questions:
- I have also reviewed the UMT article. In other issues, you've mentioned that the training parameters for InternVideo2-1B-s2 were set similarly to those of UMT. I observed that the Stage 1 and Stage 2 training methods of UMT and InternVideo2 are nearly identical, with some differences in model structure and the teacher employed. Interestingly, the zero-shot capability seems to have increased by almost 10 points. Could you confirm whether InternVideo2's Stage 2 utilizes the InternVL-6B clip teacher and why is the improvement of InternVideo2 so significant?
- We are attempting to reproduce the results of InternVideo2-stage2_1b-224p-f4.pt using your publicly released checkpoint for InternVideo2-Stage1-1B-224p-f8. We have adhered to the parameter settings as specified in the open-source code config, but we obtained a zero-shot performance of 42.9 (9% lower) for T2V R@1 on MSRVTT. Could you please review our configuration and identify any possible discrepancies? Here is our config: config.txt
I would greatly appreciate it if you could respond.
No, you can try it by post pretraining InternVideo2-1B-s2 with audio-vision-text data like VALOR or VAST, we only introduce audio for InternVideo2-6B-s2.
Thank you for your response. I have a few additional questions:
- I have also reviewed the UMT article. In other issues, you've mentioned that the training parameters for InternVideo2-1B-s2 were set similarly to those of UMT. I observed that the Stage 1 and Stage 2 training methods of UMT and InternVideo2 are nearly identical, with some differences in model structure and the teacher employed. Interestingly, the zero-shot capability seems to have increased by almost 10 points. Could you confirm whether InternVideo2's Stage 2 utilizes the InternVL-6B clip teacher and why is the improvement of InternVideo2 so significant?
- We are attempting to reproduce the results of InternVideo2-stage2_1b-224p-f4.pt using your publicly released checkpoint for InternVideo2-Stage1-1B-224p-f8. We have adhered to the parameter settings as specified in the open-source code config, but we obtained a zero-shot performance of 42.9 (9% lower) for T2V R@1 on MSRVTT. Could you please review our configuration and identify any possible discrepancies? Here is our config: config.txt
I would greatly appreciate it if you could respond.
- There are certain reasons for the scale of the teacher scale of the model and the higher quality of the training data. In my opinion, the most important factor is the higher quality of the training data.
- Did you use our code and json files to test the performance? The configuration file seems to be fine. And you should use https://huggingface.co/OpenGVLab/InternVideo2-Stage2_1B-224p-f4