LLaVA-NeXT wrong index of width and height in unpad

https://github.com/LLaVA-VL/LLaVA-NeXT/blob/8c09e0206af91c5c83fc25f55500883e9df261b8/llava/model/llava_arch.py#L137

Sep 12 '24 09:09 starhiking

I have reviewed the code, and the code is correct. (PIL image.size=[width,height]) But your comment is wrong...

Sep 12 '24 14:09 starhiking

Hi @starhiking, may I kindly ask you a question about the 'original_size' here? Thanks in advance :)

Here is my config:

parser.add_argument("--mm_patch_merge_type", type=str, default="spatial_unpad")

then it will run

elif "unpad" in mm_patch_merge_type:
                            image_feature = image_feature.permute(4, 0, 2, 1, 3).contiguous()
                            image_feature = image_feature.flatten(1, 2).flatten(2, 3)
                            image_feature = unpad_image(image_feature, image_sizes[image_idx]) 
                            image_feature = torch.cat((image_feature, self.model.image_newline[:, None, None].expand(*image_feature.shape[:-1], 1).to(image_feature.device)), dim=-1)
                            image_feature = image_feature.flatten(1, 2).transpose(0, 1)

but initially the 'image_sizes' in None when calling the 'prepare_inputs_labels_for_multimodal' function, so it gets failure.

Then I passed this parameter as

output_ids = model.generate(image_sizes=[(video[0].shape[2], video[0].shape[3]) for _ in range(video[0].shape[0])], inputs=input_ids, images=video, attention_mask=attention_masks, modalities="video", do_sample=True, temperature=0.2, max_new_tokens=1024, use_cache=True, stopping_criteria=[stopping_criteria])

then the image_sizes should be (width, height) * length of frame, e.g [(336, 336), (336, 336)], 336 is the height after clip processing, but still get failure.

May I know how you pass the image_sizes parameters here? Thanks so much.

Sep 12 '24 15:09 aloe101

I run the scripts/train/finetune_clip.sh， which is OK. The following screenshots may solve your problem.

Sep 13 '24 02:09 starhiking

wrong index of width and height in unpad_image function