InternVideo
InternVideo copied to clipboard
InternVideo2-Chat 8B Visual Encoder and Text Encoder
Dear Team, Thank you for the great work. I was currently exploring the InternVideo2-Chat 8B and had a few questions/doubts regarding it.
- What is the visual encoder used? Is it the InternVideo2 s2-1B or InternVideo2clip-1B?
- I am interested in utilising the corresponding text-encoder to the vision-encoder that was jointly trained in stage 2. In the code, you just load the vision encoder and I was wondering if it was possible to access the aligned text-encoder using the hugging-face api or I have to write a separate function like build_text_encoder similar to build_vision_encoder in modelling_base.py. file.
Hopefully it makes sense, otherwise kindly ask.
- InternVideo2 s2-1B
- If only use the stage2, you need to consider reorganizing the corresponding code according to our code instead of using the hf api directly. After loading the checkpoint of stage2, I don't think there are too many lines of code
Hey, Thank you for the quick response. I was just checking the code/checkpoint for visual encoders used for InternVideo2 stage 2 and the one used in InternVideo2-Chat 8B? In InternVideo2-Chat 8B. The visual encoders have slightlydifferent architecture and hence the ouputs are different. Can you kindly confirm what exactly was taken?
InternVideo2stage2 1-B model_visual Encoder
PretrainInternVideo2(
(patch_embed): PatchEmbed(
(proj): Conv3d(3, 1408, kernel_size=(1, 14, 14), stride=(1, 14, 14))
(norm): Identity()
)
(blocks): ModuleList(
(0): Block(
(norm1): RMSNorm()
(attn): Attention(
(qkv): Linear(in_features=1408, out_features=4224, bias=False)
(attn_drop): Dropout(p=0.0, inplace=False)
(proj): Linear(in_features=1408, out_features=1408, bias=True)
(proj_drop): Dropout(p=0.0, inplace=False)
(q_norm): RMSNorm()
(k_norm): RMSNorm()
)
(ls1): LayerScale()
(drop_path1): Identity()
(norm2): RMSNorm()
(mlp): Mlp(
(fc1): Linear(in_features=1408, out_features=6144, bias=True)
(act): GELU(approximate='none')
(drop1): Dropout(p=0.0, inplace=False)
(fc2): Linear(in_features=6144, out_features=1408, bias=True)
(drop2): Dropout(p=0.0, inplace=False)
)
(ls2): LayerScale()
(drop_path2): Identity()
)
(1-39): 39 x Block(
(norm1): RMSNorm()
(attn): Attention(
(qkv): Linear(in_features=1408, out_features=4224, bias=False)
(attn_drop): Dropout(p=0.0, inplace=False)
(proj): Linear(in_features=1408, out_features=1408, bias=True)
(proj_drop): Dropout(p=0.0, inplace=False)
(q_norm): RMSNorm()
(k_norm): RMSNorm()
)
(ls1): LayerScale()
(drop_path1): DropPath()
(norm2): RMSNorm()
(mlp): Mlp(
(fc1): Linear(in_features=1408, out_features=6144, bias=True)
(act): GELU(approximate='none')
(drop1): Dropout(p=0.0, inplace=False)
(fc2): Linear(in_features=6144, out_features=1408, bias=True)
(drop2): Dropout(p=0.0, inplace=False)
)
(ls2): LayerScale()
(drop_path2): DropPath()
)
)
(clip_projector): AttentionPoolingBlock(
(norm1_q): LayerNorm((1408,), eps=1e-05, elementwise_affine=True)
(norm1_k): LayerNorm((1408,), eps=1e-05, elementwise_affine=True)
(norm1_v): LayerNorm((1408,), eps=1e-05, elementwise_affine=True)
(cross_attn): CrossAttention(
(q): Linear(in_features=1408, out_features=1408, bias=False)
(k): Linear(in_features=1408, out_features=1408, bias=False)
(v): Linear(in_features=1408, out_features=1408, bias=False)
(attn_drop): Dropout(p=0.0, inplace=False)
(proj): Linear(in_features=1408, out_features=768, bias=True)
(proj_drop): Dropout(p=0.0, inplace=False)
)
(drop_path): Identity()
)
(clip_decoder): ModuleList(
(0-5): 6 x Linear_Decoder(
(head): Linear(in_features=1408, out_features=3200, bias=True)
(norm): LayerNorm((3200,), eps=1e-05, elementwise_affine=True)
)
)
(final_clip_decoder): Linear_Decoder(
(head): Linear(in_features=768, out_features=768, bias=True)
(norm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
)
)
**InternVideo2_VideoChat2**
PretrainVisionTransformer_clean(
(patch_embed): PatchEmbed(
(proj): Conv3d(3, 1408, kernel_size=(1, 14, 14), stride=(1, 14, 14))
(norm): Identity()
)
(blocks): ModuleList(
(0): Block(
(norm1): RMSNorm()
(attn): Attention(
(qkv): Linear(in_features=1408, out_features=4224, bias=False)
(attn_drop): Dropout(p=0.0, inplace=False)
(proj): Linear(in_features=1408, out_features=1408, bias=True)
(proj_drop): Dropout(p=0.0, inplace=False)
(q_norm): RMSNorm()
(k_norm): RMSNorm()
)
(drop_path1): Identity()
(norm2): RMSNorm()
(mlp): Mlp(
(fc1): Linear(in_features=1408, out_features=6144, bias=True)
(act): GELU(approximate='none')
(drop1): Dropout(p=0.0, inplace=False)
(fc2): Linear(in_features=6144, out_features=1408, bias=True)
(drop2): Dropout(p=0.0, inplace=False)
)
(drop_path2): Identity()
)
(1-38): 38 x Block(
(norm1): RMSNorm()
(attn): Attention(
(qkv): Linear(in_features=1408, out_features=4224, bias=False)
(attn_drop): Dropout(p=0.0, inplace=False)
(proj): Linear(in_features=1408, out_features=1408, bias=True)
(proj_drop): Dropout(p=0.0, inplace=False)
(q_norm): RMSNorm()
(k_norm): RMSNorm()
)
(drop_path1): DropPath()
(norm2): RMSNorm()
(mlp): Mlp(
(fc1): Linear(in_features=1408, out_features=6144, bias=True)
(act): GELU(approximate='none')
(drop1): Dropout(p=0.0, inplace=False)
(fc2): Linear(in_features=6144, out_features=1408, bias=True)
(drop2): Dropout(p=0.0, inplace=False)
)
(drop_path2): DropPath()
)
)
)
(vision_layernorm): LayerNorm((1408,), eps=1e-12, elementwise_affine=True)