What is the proper way to preprocess image inputs for InternVideo2-Chat?
Hi, thanks for your fantastic video foundation model! I was interested in exploring the capabilities of InternVideo2-Chat for both images and video. According to the Huggingface code, the model can take both image and video input (from modeling_videochat2.py):
However, I don't see any preprocessing and inference instructions for images. Could you please share what those are? Or are they simply to assume that the image is a single-frame video?
They are here: https://huggingface.co/OpenGVLab/InternVideo2_chat_8B_HD/blob/main/modeling_videochat2.py#L82-L90
Thanks @yinanhe for the note! While there does appear to be internal processing of images, the demo provides a lot of preprocessing for videos (https://huggingface.co/OpenGVLab/InternVideo2_chat_8B_HD), I'm just wondering if there are similar preprocessing steps for image.
Thanks @yinanhe for the note! While there does appear to be internal processing of images, the demo provides a lot of preprocessing for videos (https://huggingface.co/OpenGVLab/InternVideo2_chat_8B_HD), I'm just wondering if there are similar preprocessing steps for image.
@chancharikmitra Here is the example:
img = Image.open(img_path).convert('RGB')
resolution = 224
hd_num = 12
padding = False
mean = (0.485, 0.456, 0.406)
std = (0.229, 0.224, 0.225)
transform = transforms.Compose([
transforms.Lambda(lambda x: x.float().div(255.0)),
transforms.Normalize(mean, std)
])
img = PILToTensor()(img).unsqueeze(0)
if padding:
img = HD_transform_padding(img.float(), image_size=resolution, hd_num=hd_num)
else:
img = HD_transform_no_padding(img.float(), image_size=resolution, hd_num=hd_num)
img = transform(img).unsqueeze(0).cuda()
Perfect, thank you @yinanhe! In that case, the image preprocessing for the non-HD chat model would be something like this?
img = Image.open(img_path).convert('RGB')
mean = (0.485, 0.456, 0.406)
std = (0.229, 0.224, 0.225)
transform = transforms.Compose([
transforms.Lambda(lambda x: x.float().div(255.0)),
transforms.Resize(224, interpolation=transforms.InterpolationMode.BICUBIC),
transforms.CenterCrop(224),
transforms.Normalize(mean, std)
])
img = PILToTensor()(img).unsqueeze(0)
img = transform(img).unsqueeze(0).cuda()
@chancharikmitra In the non-HD case, we are not use centercrop, it's like:
img = Image.open(img_path).convert('RGB')
plt.imshow(img)
resolution = 224
# resolution = 384
new_pos_emb = get_sinusoid_encoding_table(n_position=(resolution//16)**2, cur_frame=1, ckpt_num_frame=1, pre_n_position=14*14)
model.vision_encoder.encoder.img_pos_embed = new_pos_emb
transform = transforms.Compose(
[
transforms.Resize(
(resolution, resolution), interpolation=InterpolationMode.BICUBIC
),
transforms.ToTensor(),
transforms.Normalize((0.48145466, 0.4578275, 0.40821073), (0.26862954, 0.26130258, 0.27577711)),
]
)
img = transform(img).unsqueeze(0).unsqueeze(0).cuda()
Sorry for the delayed response. Thank you for the clarification. I will follow your advice. However, a couple of things. This seems very different from how the frames of a video are processed by InternVideo2:
def load_video(video_path, num_segments=8, return_msg=False, resolution=224, hd_num=4, padding=False):
vr = VideoReader(video_path, ctx=cpu(0), num_threads=1)
num_frames = len(vr)
frame_indices = get_index(num_frames, num_segments)
mean = (0.485, 0.456, 0.406)
std = (0.229, 0.224, 0.225)
transform = transforms.Compose([
transforms.Lambda(lambda x: x.float().div(255.0)),
transforms.Resize(224, interpolation=transforms.InterpolationMode.BICUBIC),
transforms.CenterCrop(224),
transforms.Normalize(mean, std)
])
frames = vr.get_batch(frame_indices)
frames = frames.permute(0, 3, 1, 2)
frames = transform(frames)
T_, C, H, W = frames.shape
if return_msg:
fps = float(vr.get_avg_fps())
sec = ", ".join([str(round(f / fps, 1)) for f in frame_indices])
# " " should be added in the start and end
msg = f"The video contains {len(frame_indices)} frames sampled at {sec} seconds."
return frames, msg
else:
return frames
is what is included in HuggingFace. But you are saying that images are not center-cropped. So I'm to understand that InternVideo2 handles images differently from individual video frames? This seems to conflict with intuition, so I just would like to check that is the case. Another thing is that I notice that there is positional embedding already implemented in the model code. I'm not sure why get_sinusoid_encoding_table needs to be done separately before passing the images - which is also something we have to implement separately as well.
It would be great if the image and video inputs could be unified for inference a bit more. Also, if you wouldn't mind confirming/clarifying these points above, that would be great. I know image inference is somewhat out of distribution for InternVideo2, but I just want to make sure I am matching what was intended as closely as possible.
Sorry, closed the issue by accident.