Performance in InternVideo2-Stage2-6B Model from huggingface.
Hello,
First of all, I sincerely apologize for the duplicate issue on GitHub and Hugging Face. I found the model at this link on Hugging Face and proceeded to run the code for implementation.
I tested demo.py and noticed some unusual results.
Initially, I ran the code using the provided sample video file and text descriptions:
text_candidates = ["A playful dog and its owner wrestle in the snowy yard, chasing each other with joyous abandon.",
"A man in a gray coat walks through the snowy landscape, pulling a sleigh loaded with toys.",
"A person dressed in a blue jacket shovels the snow-covered pavement outside their house.",
"A cat excitedly runs through the yard, chasing a rabbit.",
"A person bundled up in a blanket walks through the snowy landscape, enjoying the serene winter scenery."]
The output was:
text: A playful dog and its owner wrestle in the snowy yard, chasing each other with joyous abandon. ~ prob: 0.5354
text: A cat excitedly runs through the yard, chasing a rabbit. ~ prob: 0.2978
text: A man in a gray coat walks through the snowy landscape, pulling a sleigh loaded with toys. ~ prob: 0.0989
text: A person dressed in a blue jacket shovels the snow-covered pavement outside their house. ~ prob: 0.0630
text: A person bundled up in a blanket walks through the snowy landscape, enjoying the serene winter scenery. ~ prob: 0.0048
This result seems reasonable. However, when I tested with the following paraphrased descriptions, the results were not as expected:
paraphrased_text_candidates = [
"A cheerful dog and its owner tumble and chase each other in the snow-covered yard, full of excitement.",
"A man wearing a gray coat strides through the snowy terrain, dragging a sleigh stacked with toys.",
"Wearing a blue jacket, a person clears the snow from their driveway with a shovel.",
"A cat dashes energetically across the yard, pursuing a rabbit.",
"Wrapped in a warm blanket, a person strolls through the snowy landscape, admiring the peaceful winter atmosphere."
]
The output was:
text: A man wearing a gray coat strides through the snowy terrain, dragging a sleigh stacked with toys. ~ prob: 0.7446
text: A cat dashes energetically across the yard, pursuing a rabbit. ~ prob: 0.1992
text: A cheerful dog and its owner tumble and chase each other in the snow-covered yard, full of excitement. ~ prob: 0.0257
text: Wearing a blue jacket, a person clears the snow from their driveway with a shovel. ~ prob: 0.0200
text: Wrapped in a warm blanket, a person strolls through the snowy landscape, admiring the peaceful winter atmosphere. ~ prob: 0.0105
I am curious why the retrieval code (predict_label) multiplies the video embedding (feature) by 100 in the following function:
def predict_label(self,
vid_feat: torch.Tensor,
txt_feat: torch.Tensor,
top: int=5):
label_probs = (100.0 * vid_feat @ txt_feat.T).softmax(dim=-1)
top_probs, top_labels = label_probs.float().cpu().topk(top, dim=-1)
return top_probs, top_labels
There is a bug, we are fixing, you could try OpenGVLab/InternVideo2-Stage2_6B-224p-f4
Thank you for reviewing my issue! I’ll try a new one.
Thank you for reviewing my issue! I’ll try a new one.
Have you solved this problem? I also encountered this problem
Thank you for reviewing my issue! I’ll try a new one.
Have you solved this problem? I also encountered this problem
have you tried OpenGVLab/InternVideo2-Stage2_6B-224p-f4?
@lovepan1 I tested with the new model and actually it worked well.
For the Stage 2 model, you can try this one: https://huggingface.co/OpenGVLab/InternVideo2-Stage2_6B-224p-f4/tree/main Oh however, it seems this link is now deprecated, so we might need to wait until they re-upload it...
You’ll also need the configuration file for the 6B model, which is located here: https://github.com/OpenGVLab/InternVideo/blob/main/InternVideo2/multi_modality/scripts/pretraining/stage2/6B/config.py
If you’re planning to use the CLIP model, you can reuse the same LLaMA parameters as the 1B model. Additional parameters were previously available here: https://huggingface.co/OpenGVLab/InternVideo2-CLIP-6B-224p-f8 However, it seems this link is now deprecated, so we might need to wait until they re-upload it.
Also, I used interpolation for positional encoding to test with 8 video frames. You can find the relevant code here: https://github.com/OpenGVLab/InternVideo/blob/main/InternVideo2/multi_modality/models/backbones/internvideo2/pos_embed.py
@lovepan1 I tested with the new model and actually it worked well.
For the Stage 2 model, you can try this one: https://huggingface.co/OpenGVLab/InternVideo2-Stage2_6B-224p-f4/tree/main Oh however, it seems this link is now deprecated, so we might need to wait until they re-upload it...
You’ll also need the configuration file for the 6B model, which is located here: https://github.com/OpenGVLab/InternVideo/blob/main/InternVideo2/multi_modality/scripts/pretraining/stage2/6B/config.py
If you’re planning to use the CLIP model, you can reuse the same LLaMA parameters as the 1B model. Additional parameters were previously available here: https://huggingface.co/OpenGVLab/InternVideo2-CLIP-6B-224p-f8 However, it seems this link is now deprecated, so we might need to wait until they re-upload it.
Also, I used interpolation for positional encoding to test with 8 video frames. You can find the relevant code here: https://github.com/OpenGVLab/InternVideo/blob/main/InternVideo2/multi_modality/models/backbones/internvideo2/pos_embed.py
It's ok for me to open these links, please check your network.
@lovepan1 I tested with the new model and actually it worked well.
For the Stage 2 model, you can try this one: https://huggingface.co/OpenGVLab/InternVideo2-Stage2_6B-224p-f4/tree/main Oh however, it seems this link is now deprecated, so we might need to wait until they re-upload it...
You’ll also need the configuration file for the 6B model, which is located here: https://github.com/OpenGVLab/InternVideo/blob/main/InternVideo2/multi_modality/scripts/pretraining/stage2/6B/config.py
If you’re planning to use the CLIP model, you can reuse the same LLaMA parameters as the 1B model. Additional parameters were previously available here: https://huggingface.co/OpenGVLab/InternVideo2-CLIP-6B-224p-f8 However, it seems this link is now deprecated, so we might need to wait until they re-upload it.
Also, I used interpolation for positional encoding to test with 8 video frames. You can find the relevant code here: https://github.com/OpenGVLab/InternVideo/blob/main/InternVideo2/multi_modality/models/backbones/internvideo2/pos_embed.py
This problem occurred when using use stage2_6B config and internvideo2-s2_6b-224p-f4.pt
@lovepan1 I think you’ll need to modify the model configuration, especially the parameter shapes, to ensure the pretrained weights load correctly. I actually refactored the model loading process myself instead of relying on the original code—this was the best way I could make it work and explain things in detail.
@leexinhao
6b/1b model text encoder shape is different
@lovepan1 I think you’ll need to modify the model configuration, especially the parameter shapes, to ensure the pretrained weights load correctly. I actually refactored the model loading process myself instead of relying on the original code—this was the best way I could make it work and explain things in detail.
@leexinhao 6b/1b model text encoder shape is different
OK, thank U, I will try it first to see if modifying the text configuration is effective. @newcommandd
this is my result:
this code:
this is my bert-large-uncased/config.json:
edit d_model=768 in stage2_config.py or add "encoder_width": 768 in bert-large-uncased/config.json @newcommandd @leexinhao
I have fixed this problem and add new code for InternVideo2-6B.