wrong index of width and height in unpad_image function
https://github.com/LLaVA-VL/LLaVA-NeXT/blob/8c09e0206af91c5c83fc25f55500883e9df261b8/llava/model/llava_arch.py#L137
I have reviewed the code, and the code is correct. (PIL image.size=[width,height]) But your comment is wrong...
Hi @starhiking, may I kindly ask you a question about the 'original_size' here? Thanks in advance :)
Here is my config:
parser.add_argument("--mm_patch_merge_type", type=str, default="spatial_unpad")
then it will run
elif "unpad" in mm_patch_merge_type:
image_feature = image_feature.permute(4, 0, 2, 1, 3).contiguous()
image_feature = image_feature.flatten(1, 2).flatten(2, 3)
image_feature = unpad_image(image_feature, image_sizes[image_idx])
image_feature = torch.cat((image_feature, self.model.image_newline[:, None, None].expand(*image_feature.shape[:-1], 1).to(image_feature.device)), dim=-1)
image_feature = image_feature.flatten(1, 2).transpose(0, 1)
but initially the 'image_sizes' in None when calling the 'prepare_inputs_labels_for_multimodal' function, so it gets failure.
Then I passed this parameter as
output_ids = model.generate(image_sizes=[(video[0].shape[2], video[0].shape[3]) for _ in range(video[0].shape[0])], inputs=input_ids, images=video, attention_mask=attention_masks, modalities="video", do_sample=True, temperature=0.2, max_new_tokens=1024, use_cache=True, stopping_criteria=[stopping_criteria])
then the image_sizes should be (width, height) * length of frame, e.g [(336, 336), (336, 336)], 336 is the height after clip processing, but still get failure.
May I know how you pass the image_sizes parameters here? Thanks so much.
I run the scripts/train/finetune_clip.sh, which is OK. The following screenshots may solve your problem.