InternVideo What is the proper way to preprocess image inputs for InternVideo2-Chat?

Hi, thanks for your fantastic video foundation model! I was interested in exploring the capabilities of InternVideo2-Chat for both images and video. According to the Huggingface code, the model can take both image and video input (from modeling_videochat2.py):

However, I don't see any preprocessing and inference instructions for images. Could you please share what those are? Or are they simply to assume that the image is a single-frame video?

Sep 01 '24 23:09 chancharikmitra

They are here: https://huggingface.co/OpenGVLab/InternVideo2_chat_8B_HD/blob/main/modeling_videochat2.py#L82-L90

Sep 03 '24 03:09 yinanhe

Thanks @yinanhe for the note! While there does appear to be internal processing of images, the demo provides a lot of preprocessing for videos (https://huggingface.co/OpenGVLab/InternVideo2_chat_8B_HD), I'm just wondering if there are similar preprocessing steps for image.

Sep 03 '24 23:09 chancharikmitra

Thanks @yinanhe for the note! While there does appear to be internal processing of images, the demo provides a lot of preprocessing for videos (https://huggingface.co/OpenGVLab/InternVideo2_chat_8B_HD), I'm just wondering if there are similar preprocessing steps for image.

@chancharikmitra Here is the example:

img = Image.open(img_path).convert('RGB')

resolution = 224
hd_num = 12
padding = False

mean = (0.485, 0.456, 0.406)
std = (0.229, 0.224, 0.225)

transform = transforms.Compose([
    transforms.Lambda(lambda x: x.float().div(255.0)),
    transforms.Normalize(mean, std)
])
img = PILToTensor()(img).unsqueeze(0)

if padding:
    img = HD_transform_padding(img.float(), image_size=resolution, hd_num=hd_num)
else:
    img = HD_transform_no_padding(img.float(), image_size=resolution, hd_num=hd_num)
    
img = transform(img).unsqueeze(0).cuda()

Sep 04 '24 11:09 yinanhe

Perfect, thank you @yinanhe! In that case, the image preprocessing for the non-HD chat model would be something like this?

img = Image.open(img_path).convert('RGB')

mean = (0.485, 0.456, 0.406)
std = (0.229, 0.224, 0.225)

transform = transforms.Compose([
            transforms.Lambda(lambda x: x.float().div(255.0)),
            transforms.Resize(224, interpolation=transforms.InterpolationMode.BICUBIC),
            transforms.CenterCrop(224),
            transforms.Normalize(mean, std)
        ])
img = PILToTensor()(img).unsqueeze(0)

img = transform(img).unsqueeze(0).cuda()

Sep 04 '24 15:09 chancharikmitra

@chancharikmitra In the non-HD case, we are not use centercrop, it's like:

img = Image.open(img_path).convert('RGB')

plt.imshow(img)

resolution = 224
# resolution = 384
new_pos_emb = get_sinusoid_encoding_table(n_position=(resolution//16)**2, cur_frame=1, ckpt_num_frame=1, pre_n_position=14*14)
model.vision_encoder.encoder.img_pos_embed = new_pos_emb

transform = transforms.Compose(
    [
        transforms.Resize(
            (resolution, resolution), interpolation=InterpolationMode.BICUBIC
        ),
        transforms.ToTensor(),
        transforms.Normalize((0.48145466, 0.4578275, 0.40821073), (0.26862954, 0.26130258, 0.27577711)),
    ]
)

img = transform(img).unsqueeze(0).unsqueeze(0).cuda()

Sep 05 '24 01:09 yinanhe

Sorry for the delayed response. Thank you for the clarification. I will follow your advice. However, a couple of things. This seems very different from how the frames of a video are processed by InternVideo2:

def load_video(video_path, num_segments=8, return_msg=False, resolution=224, hd_num=4, padding=False):
    vr = VideoReader(video_path, ctx=cpu(0), num_threads=1)
    num_frames = len(vr)
    frame_indices = get_index(num_frames, num_segments)
    mean = (0.485, 0.456, 0.406)
    std = (0.229, 0.224, 0.225)

    transform = transforms.Compose([
        transforms.Lambda(lambda x: x.float().div(255.0)),
        transforms.Resize(224, interpolation=transforms.InterpolationMode.BICUBIC),
        transforms.CenterCrop(224),
        transforms.Normalize(mean, std)
    ])

    frames = vr.get_batch(frame_indices)
    frames = frames.permute(0, 3, 1, 2)
    frames = transform(frames)

    T_, C, H, W = frames.shape
        
    if return_msg:
        fps = float(vr.get_avg_fps())
        sec = ", ".join([str(round(f / fps, 1)) for f in frame_indices])
        # " " should be added in the start and end
        msg = f"The video contains {len(frame_indices)} frames sampled at {sec} seconds."
        return frames, msg
    else:
        return frames

is what is included in HuggingFace. But you are saying that images are not center-cropped. So I'm to understand that InternVideo2 handles images differently from individual video frames? This seems to conflict with intuition, so I just would like to check that is the case. Another thing is that I notice that there is positional embedding already implemented in the model code. I'm not sure why get_sinusoid_encoding_table needs to be done separately before passing the images - which is also something we have to implement separately as well.

It would be great if the image and video inputs could be unified for inference a bit more. Also, if you wouldn't mind confirming/clarifying these points above, that would be great. I know image inference is somewhat out of distribution for InternVideo2, but I just want to make sure I am matching what was intended as closely as possible.

Sep 14 '24 16:09 chancharikmitra

Sorry, closed the issue by accident.

Sep 14 '24 20:09 chancharikmitra