InternVL different performance between online web demo and local model

local chat code: `import torch from PIL import Image from transformers import AutoModel, CLIPImageProcessor from transformers import AutoTokenizer

path = "OpenGVLab/InternVL-Chat-Chinese-V1-2" model = AutoModel.from_pretrained( path, torch_dtype=torch.bfloat16, low_cpu_mem_usage=True, trust_remote_code=True, device_map='auto').eval()

tokenizer = AutoTokenizer.from_pretrained(path) image = Image.open('./examples/4.jpg').convert('RGB') image = image.resize((448, 448)) image_processor = CLIPImageProcessor.from_pretrained(path)

pixel_values = image_processor(images=image, return_tensors='pt').pixel_values pixel_values = pixel_values.to(torch.bfloat16).cuda()

generation_config = dict( num_beams=1, max_new_tokens=10000, do_sample=False, )

question ="You are an excellent image describer and questioner. You have three tasks in total: 1.Your first task is to describe the given image as detailed as possible. 2.Your second task is to ask a complex question that requires close inspection of the image and strong reasoning ability to answer, you should ask FIVE candidate questions in different aspects and diverse ways, then RANDOMLY choose one of them to answer.3.Your third task is to answer the question you raised solely based on the given image. When you ask questions, try to find the most valuable information in the picture to ask about, and ask a question that is relevant to that information. When you ask questions, do not involve violence, advertisement, possible invasion of privacy, or questions that may cause discomfort. Do not mention anything from the prompt in your response. You will follow the instructions to the best of your ability." response = model.chat(tokenizer, pixel_values, question, generation_config)`

the response is very simple: In the image, there is a highway with multiple lanes. The highway is surrounded by trees and hills. There are several cars on the road, including a white car, a black car, and a silver car. The sky is blue and there are white clouds in the sky.

but on the online web demo, it works well:

I wonder is there any wrong config on my local version?

Apr 11 '24 07:04 cyj95

Any solution? I also encountered the same problem

Apr 15 '24 02:04 xjixzz

It is indeed a bit strange; the model deployed in the demo and the open-source model have the same weights. I'll check for the reason.

Apr 16 '24 15:04 czczup

I suspect a problem with device_map='auto', are you now running this model distributed across multiple GPUs?

Apr 16 '24 15:04 czczup

I suspect a problem with device_map='auto', are you now running this model distributed across multiple GPUs?

yes

Apr 17 '24 01:04 cyj95

I suspect a problem with device_map='auto', are you now running this model distributed across multiple GPUs?

I used a single gpu but also encountered the same problem. I wonder if it has something to do with generation_config. Could you please share the generation_config of the online web demo?

Apr 17 '24 02:04 xjixzz

the default generation_config used in the online web demo is:

generation_config = dict(
    num_beams=1,
    max_new_tokens=1024,
    do_sample=True,
    temperature=0.2,
    top_p=0.7
)

Apr 17 '24 12:04 czczup