InternVideo icon indicating copy to clipboard operation
InternVideo copied to clipboard

Performance in InternVideo2-Stage2-6B Model from huggingface.

Open joooooonyoung opened this issue 11 months ago • 12 comments

Hello,

First of all, I sincerely apologize for the duplicate issue on GitHub and Hugging Face. I found the model at this link on Hugging Face and proceeded to run the code for implementation.

I tested demo.py and noticed some unusual results.

Initially, I ran the code using the provided sample video file and text descriptions:

text_candidates = ["A playful dog and its owner wrestle in the snowy yard, chasing each other with joyous abandon.",
                    "A man in a gray coat walks through the snowy landscape, pulling a sleigh loaded with toys.",
                    "A person dressed in a blue jacket shovels the snow-covered pavement outside their house.",
                    "A cat excitedly runs through the yard, chasing a rabbit.",
                    "A person bundled up in a blanket walks through the snowy landscape, enjoying the serene winter scenery."]

The output was:

text: A playful dog and its owner wrestle in the snowy yard, chasing each other with joyous abandon. ~ prob: 0.5354
text: A cat excitedly runs through the yard, chasing a rabbit. ~ prob: 0.2978
text: A man in a gray coat walks through the snowy landscape, pulling a sleigh loaded with toys. ~ prob: 0.0989
text: A person dressed in a blue jacket shovels the snow-covered pavement outside their house. ~ prob: 0.0630
text: A person bundled up in a blanket walks through the snowy landscape, enjoying the serene winter scenery. ~ prob: 0.0048

This result seems reasonable. However, when I tested with the following paraphrased descriptions, the results were not as expected:

paraphrased_text_candidates = [
        "A cheerful dog and its owner tumble and chase each other in the snow-covered yard, full of excitement.",
        "A man wearing a gray coat strides through the snowy terrain, dragging a sleigh stacked with toys.",
        "Wearing a blue jacket, a person clears the snow from their driveway with a shovel.",
        "A cat dashes energetically across the yard, pursuing a rabbit.",
        "Wrapped in a warm blanket, a person strolls through the snowy landscape, admiring the peaceful winter atmosphere."
    ]

The output was:

text: A man wearing a gray coat strides through the snowy terrain, dragging a sleigh stacked with toys. ~ prob: 0.7446
text: A cat dashes energetically across the yard, pursuing a rabbit. ~ prob: 0.1992
text: A cheerful dog and its owner tumble and chase each other in the snow-covered yard, full of excitement. ~ prob: 0.0257
text: Wearing a blue jacket, a person clears the snow from their driveway with a shovel. ~ prob: 0.0200
text: Wrapped in a warm blanket, a person strolls through the snowy landscape, admiring the peaceful winter atmosphere. ~ prob: 0.0105

I am curious why the retrieval code (predict_label) multiplies the video embedding (feature) by 100 in the following function:

def predict_label(self, 
                      vid_feat: torch.Tensor, 
                      txt_feat: torch.Tensor, 
                      top: int=5):
        
        label_probs = (100.0 * vid_feat @ txt_feat.T).softmax(dim=-1)
        top_probs, top_labels = label_probs.float().cpu().topk(top, dim=-1)
        return top_probs, top_labels

joooooonyoung avatar Feb 19 '25 04:02 joooooonyoung

There is a bug, we are fixing, you could try OpenGVLab/InternVideo2-Stage2_6B-224p-f4

leexinhao avatar Feb 26 '25 04:02 leexinhao

Thank you for reviewing my issue! I’ll try a new one.

joooooonyoung avatar Feb 26 '25 05:02 joooooonyoung

Thank you for reviewing my issue! I’ll try a new one.

Have you solved this problem? I also encountered this problem

lovepan1 avatar Apr 09 '25 11:04 lovepan1

Thank you for reviewing my issue! I’ll try a new one.

Have you solved this problem? I also encountered this problem

have you tried OpenGVLab/InternVideo2-Stage2_6B-224p-f4?

leexinhao avatar Apr 13 '25 15:04 leexinhao

@lovepan1 I tested with the new model and actually it worked well.

For the Stage 2 model, you can try this one: https://huggingface.co/OpenGVLab/InternVideo2-Stage2_6B-224p-f4/tree/main Oh however, it seems this link is now deprecated, so we might need to wait until they re-upload it...

You’ll also need the configuration file for the 6B model, which is located here: https://github.com/OpenGVLab/InternVideo/blob/main/InternVideo2/multi_modality/scripts/pretraining/stage2/6B/config.py

If you’re planning to use the CLIP model, you can reuse the same LLaMA parameters as the 1B model. Additional parameters were previously available here: https://huggingface.co/OpenGVLab/InternVideo2-CLIP-6B-224p-f8 However, it seems this link is now deprecated, so we might need to wait until they re-upload it.

Also, I used interpolation for positional encoding to test with 8 video frames. You can find the relevant code here: https://github.com/OpenGVLab/InternVideo/blob/main/InternVideo2/multi_modality/models/backbones/internvideo2/pos_embed.py

joooooonyoung avatar Apr 14 '25 09:04 joooooonyoung

@lovepan1 I tested with the new model and actually it worked well.

For the Stage 2 model, you can try this one: https://huggingface.co/OpenGVLab/InternVideo2-Stage2_6B-224p-f4/tree/main Oh however, it seems this link is now deprecated, so we might need to wait until they re-upload it...

You’ll also need the configuration file for the 6B model, which is located here: https://github.com/OpenGVLab/InternVideo/blob/main/InternVideo2/multi_modality/scripts/pretraining/stage2/6B/config.py

If you’re planning to use the CLIP model, you can reuse the same LLaMA parameters as the 1B model. Additional parameters were previously available here: https://huggingface.co/OpenGVLab/InternVideo2-CLIP-6B-224p-f8 However, it seems this link is now deprecated, so we might need to wait until they re-upload it.

Also, I used interpolation for positional encoding to test with 8 video frames. You can find the relevant code here: https://github.com/OpenGVLab/InternVideo/blob/main/InternVideo2/multi_modality/models/backbones/internvideo2/pos_embed.py

It's ok for me to open these links, please check your network.

leexinhao avatar Apr 14 '25 11:04 leexinhao

@lovepan1 I tested with the new model and actually it worked well.

For the Stage 2 model, you can try this one: https://huggingface.co/OpenGVLab/InternVideo2-Stage2_6B-224p-f4/tree/main Oh however, it seems this link is now deprecated, so we might need to wait until they re-upload it...

You’ll also need the configuration file for the 6B model, which is located here: https://github.com/OpenGVLab/InternVideo/blob/main/InternVideo2/multi_modality/scripts/pretraining/stage2/6B/config.py

If you’re planning to use the CLIP model, you can reuse the same LLaMA parameters as the 1B model. Additional parameters were previously available here: https://huggingface.co/OpenGVLab/InternVideo2-CLIP-6B-224p-f8 However, it seems this link is now deprecated, so we might need to wait until they re-upload it.

Also, I used interpolation for positional encoding to test with 8 video frames. You can find the relevant code here: https://github.com/OpenGVLab/InternVideo/blob/main/InternVideo2/multi_modality/models/backbones/internvideo2/pos_embed.py

This problem occurred when using use stage2_6B config and internvideo2-s2_6b-224p-f4.pt Image

lovepan1 avatar Apr 15 '25 07:04 lovepan1

@lovepan1 I think you’ll need to modify the model configuration, especially the parameter shapes, to ensure the pretrained weights load correctly. I actually refactored the model loading process myself instead of relying on the original code—this was the best way I could make it work and explain things in detail.

joooooonyoung avatar Apr 15 '25 07:04 joooooonyoung

@leexinhao
6b/1b model text encoder shape is different Image

lovepan1 avatar Apr 15 '25 07:04 lovepan1

@lovepan1 I think you’ll need to modify the model configuration, especially the parameter shapes, to ensure the pretrained weights load correctly. I actually refactored the model loading process myself instead of relying on the original code—this was the best way I could make it work and explain things in detail.

@leexinhao 6b/1b model text encoder shape is different Image

OK, thank U, I will try it first to see if modifying the text configuration is effective. @newcommandd

lovepan1 avatar Apr 15 '25 07:04 lovepan1

this is my result:

Image

this code:

Image

this is my bert-large-uncased/config.json:

Image

edit d_model=768 in stage2_config.py or add "encoder_width": 768 in bert-large-uncased/config.json @newcommandd @leexinhao

lovepan1 avatar Apr 15 '25 08:04 lovepan1

I have fixed this problem and add new code for InternVideo2-6B.

leexinhao avatar Aug 03 '25 07:08 leexinhao