InternVideo
InternVideo copied to clipboard
[ECCV2024] Video Foundation Models & Data for Multimodal Understanding
Thank you for your work! But I have a question about zero shot video-retrieval task on activitynet dataset, which pretrain model I should use to reproduce the performance?Is Clip ViT-L-14.pt?...
Hello, I am unable to the spatiotemporal action localization. It would be good to know how to run the shared module for spatiotemporal action recognition. Best
Hello, could you please release the ckpt of vit-H model. Thanks.
hi, since 200M pretrained dataset is much bigger than 10M version, so why the zero shot performance is not superior than 10M?
hi,  the code say in video training, need to set find_unused_parameters=True, but in images, it can be set False. I wonder why? thank you.
I want to run the finetuned model on InternVideo-MM-L-14 | ActivityNet. I have my own costum videos. Do you have a simple demo script to run this model? (Similar to...
In https://github.com/OpenGVLab/InternVideo/blob/6264cc85f72e38dce7e38549182d0369c50cde00/Data/InternVid/viclip/viclip.py#L140-L143 a single video is converted to a batch of 8 images (instead of a batch of one video). This bug is transferred to the demo in https://github.com/OpenGVLab/InternVideo/blob/6264cc85f72e38dce7e38549182d0369c50cde00/Data/InternVid/viclip/__init__.py#L71 where...
Thank you for nice work. In training ViCLIP, I would like to clarify my understanding of this paper. If vision transforms is not pre-trained such as MAE method, then, it...
Hello, I am so glad that you open-sourced the checkpoints and the demo script recently. When I run the provided demo script, I found that the `viclip.py` attempt to import...
Hi Authors ! Im trying to reproduce InternVideo+ActionFormer for temporal action localization. Just wanted to know your timelines to release the UniformerV2 features? Thank you,