llama-cpp-python icon indicating copy to clipboard operation
llama-cpp-python copied to clipboard

Add minicpm-o and qwen2-vl to the list of supported multimodal models.

Open kseyhan opened this issue 1 year ago • 11 comments

Support for the Qwen2-VL and MiniCPM-o models would be nice. They already have have been merged into the llava subproject of llama.cpp.

kseyhan avatar Jan 24 '25 19:01 kseyhan

+1

lelefontaa avatar Feb 04 '25 09:02 lelefontaa

hmm, just tested again. maybe was me or i did pull an outdated llama or what last time. minicpm-o seems to work with the "minicpm-v-2.6" chat handler.

kseyhan avatar Feb 05 '25 21:02 kseyhan

Yes, minicpm-o-2.6 works with the minicpm-v-2.6 chat handler. But Qwen2-VL seems does not work with any existing chat handler.

I try to use the example chat template from llama.cpp but it still generate random characters...

la1ty avatar Feb 09 '25 07:02 la1ty

Yes, minicpm-o-2.6 works with the minicpm-v-2.6 chat handler. But Qwen2-VL seems does not work with any existing chat handler.

I try to use the example chat template from llama.cpp but it still generate random characters...

This is interesting. Could you give us the GGUF model urls you are using?

samkoesnadi avatar Feb 09 '25 08:02 samkoesnadi

@samkoesnadi I downloaded them from HuggingFace. Hope you have some good news.

la1ty avatar Feb 09 '25 08:02 la1ty

@samkoesnadi i tried my luck with Qwen2-VL-7B-Instruct-GGUF and tried almost every registered chat handler that includes a <|im_start|> and <|im_end|> token in the template and got the same results as @la1ty with random words in random languages as reply.

i also tried to implment the chat template myself but unfortunately did fail since i didnt realy understand the jinja template:

{
"chat_template": "{% set image_count = namespace(value=0) %}{% set video_count = namespace(value=0) %}{% for message in messages %}{% if loop.first and message['role'] != 'system' %}<|im_start|>system\nYou are a helpful assistant.<|im_end|>\n{% endif %}<|im_start|>{{ message['role'] }}\n{% if message['content'] is string %}{{ message['content'] }}<|im_end|>\n{% else %}{% for content in message['content'] %}{% if content['type'] == 'image' or 'image' in content or 'image_url' in content %}{% set image_count.value = image_count.value + 1 %}{% if add_vision_id %}Picture {{ image_count.value }}: {% endif %}<|vision_start|><|image_pad|><|vision_end|>{% elif content['type'] == 'video' or 'video' in content %}{% set video_count.value = video_count.value + 1 %}{% if add_vision_id %}Video {{ video_count.value }}: {% endif %}<|vision_start|><|video_pad|><|vision_end|>{% elif 'text' in content %}{{ content['text'] }}{% endif %}{% endfor %}<|im_end|>\n{% endif %}{% endfor %}{% if add_generation_prompt %}<|im_start|>assistant\n{% endif %}"
}

the template expects <|vision_start|><|image_pad|><|vision_end|> (which is verry unique for this model and not registered in any chat handler so far) and i didnt realy see where the base64 encoded string / image_url should go in this template to be true.

kseyhan avatar Feb 09 '25 17:02 kseyhan

@samkoesnadi i tried my luck with Qwen2-VL-7B-Instruct-GGUF and tried almost every registered chat handler that includes a <|im_start|> and <|im_end|> token in the template and got the same results as @la1ty with random words in random languages as reply.

i also tried to implment the chat template myself but unfortunately did fail since i didnt realy understand the jinja template:

{
"chat_template": "{% set image_count = namespace(value=0) %}{% set video_count = namespace(value=0) %}{% for message in messages %}{% if loop.first and message['role'] != 'system' %}<|im_start|>system\nYou are a helpful assistant.<|im_end|>\n{% endif %}<|im_start|>{{ message['role'] }}\n{% if message['content'] is string %}{{ message['content'] }}<|im_end|>\n{% else %}{% for content in message['content'] %}{% if content['type'] == 'image' or 'image' in content or 'image_url' in content %}{% set image_count.value = image_count.value + 1 %}{% if add_vision_id %}Picture {{ image_count.value }}: {% endif %}<|vision_start|><|image_pad|><|vision_end|>{% elif content['type'] == 'video' or 'video' in content %}{% set video_count.value = video_count.value + 1 %}{% if add_vision_id %}Video {{ video_count.value }}: {% endif %}<|vision_start|><|video_pad|><|vision_end|>{% elif 'text' in content %}{{ content['text'] }}{% endif %}{% endfor %}<|im_end|>\n{% endif %}{% endfor %}{% if add_generation_prompt %}<|im_start|>assistant\n{% endif %}"
}

the template expects <|vision_start|><|image_pad|><|vision_end|> (which is verry unique for this model and not registered in any chat handler so far) and i didnt realy see where the base64 encoded string / image_url should go in this template to be true.

@la1ty could you guys try the 2B and see if it works? That's the one I tested...

samkoesnadi avatar Feb 09 '25 17:02 samkoesnadi

@samkoesnadi which chat handler did you use if i may ask? the exact url to the model you used there would be usefull aswell.

kseyhan avatar Feb 09 '25 17:02 kseyhan

@kseyhan Yes, that's what I exactly experienced.

And I don't know if I make errors in compiling, but I found that text responses generating by Qwen2-VL-7b with llama-cpp-python v0.3.7 are mostly nonsense, which is not identical to the behavior in llama-cli.exe. Maybe I need to recompile it with the latest version of llama.cpp.

@samkoesnadi Yes it works with llama-cli.exe and llama-qwen2vl-cli.exe in llama.cpp, though llama-qwen2vl-cli.exe has an encoding problem for non-ascii characters on Windows platform?

la1ty avatar Feb 10 '25 02:02 la1ty

Comments that may be off the topic:

~~I tested Qwen2.5-VL-7B in several use cases and it seemed that it didn't perform better than MiniCPM-O-2.6. If you want to build a visual application, don't expect too much from it. Just use another model or wait for the next version...~~

Qwen2.5-VL-32B has been released and it is the best open source OCR model for now. Support for Qwen2.5-VL series is still valuable.

la1ty avatar Mar 19 '25 03:03 la1ty

Is there any update on supporting qwen2-vl?

hermeschen1116 avatar Aug 25 '25 08:08 hermeschen1116