llama-cpp-python Add minicpm-o and qwen2-vl to the list of supported multimodal models.

Support for the Qwen2-VL and MiniCPM-o models would be nice. They already have have been merged into the llava subproject of llama.cpp.

Jan 24 '25 19:01 kseyhan

+1

Feb 04 '25 09:02 lelefontaa

hmm, just tested again. maybe was me or i did pull an outdated llama or what last time. minicpm-o seems to work with the "minicpm-v-2.6" chat handler.

Feb 05 '25 21:02 kseyhan

Yes, minicpm-o-2.6 works with the minicpm-v-2.6 chat handler. But Qwen2-VL seems does not work with any existing chat handler.

I try to use the example chat template from llama.cpp but it still generate random characters...

Feb 09 '25 07:02 la1ty

Yes, minicpm-o-2.6 works with the minicpm-v-2.6 chat handler. But Qwen2-VL seems does not work with any existing chat handler.

I try to use the example chat template from llama.cpp but it still generate random characters...

This is interesting. Could you give us the GGUF model urls you are using?

Feb 09 '25 08:02 samkoesnadi

@samkoesnadi I downloaded them from HuggingFace. Hope you have some good news.

Feb 09 '25 08:02 la1ty

@samkoesnadi i tried my luck with Qwen2-VL-7B-Instruct-GGUF and tried almost every registered chat handler that includes a <|im_start|> and <|im_end|> token in the template and got the same results as @la1ty with random words in random languages as reply.

i also tried to implment the chat template myself but unfortunately did fail since i didnt realy understand the jinja template:

{
"chat_template": "{% set image_count = namespace(value=0) %}{% set video_count = namespace(value=0) %}{% for message in messages %}{% if loop.first and message['role'] != 'system' %}<|im_start|>system\nYou are a helpful assistant.<|im_end|>\n{% endif %}<|im_start|>{{ message['role'] }}\n{% if message['content'] is string %}{{ message['content'] }}<|im_end|>\n{% else %}{% for content in message['content'] %}{% if content['type'] == 'image' or 'image' in content or 'image_url' in content %}{% set image_count.value = image_count.value + 1 %}{% if add_vision_id %}Picture {{ image_count.value }}: {% endif %}<|vision_start|><|image_pad|><|vision_end|>{% elif content['type'] == 'video' or 'video' in content %}{% set video_count.value = video_count.value + 1 %}{% if add_vision_id %}Video {{ video_count.value }}: {% endif %}<|vision_start|><|video_pad|><|vision_end|>{% elif 'text' in content %}{{ content['text'] }}{% endif %}{% endfor %}<|im_end|>\n{% endif %}{% endfor %}{% if add_generation_prompt %}<|im_start|>assistant\n{% endif %}"
}

Feb 09 '25 17:02 kseyhan

@samkoesnadi i tried my luck with Qwen2-VL-7B-Instruct-GGUF and tried almost every registered chat handler that includes a <|im_start|> and <|im_end|> token in the template and got the same results as @la1ty with random words in random languages as reply.

i also tried to implment the chat template myself but unfortunately did fail since i didnt realy understand the jinja template:
{
"chat_template": "{% set image_count = namespace(value=0) %}{% set video_count = namespace(value=0) %}{% for message in messages %}{% if loop.first and message['role'] != 'system' %}<|im_start|>system\nYou are a helpful assistant.<|im_end|>\n{% endif %}<|im_start|>{{ message['role'] }}\n{% if message['content'] is string %}{{ message['content'] }}<|im_end|>\n{% else %}{% for content in message['content'] %}{% if content['type'] == 'image' or 'image' in content or 'image_url' in content %}{% set image_count.value = image_count.value + 1 %}{% if add_vision_id %}Picture {{ image_count.value }}: {% endif %}<|vision_start|><|image_pad|><|vision_end|>{% elif content['type'] == 'video' or 'video' in content %}{% set video_count.value = video_count.value + 1 %}{% if add_vision_id %}Video {{ video_count.value }}: {% endif %}<|vision_start|><|video_pad|><|vision_end|>{% elif 'text' in content %}{{ content['text'] }}{% endif %}{% endfor %}<|im_end|>\n{% endif %}{% endfor %}{% if add_generation_prompt %}<|im_start|>assistant\n{% endif %}"
}
the template expects <|vision_start|><|image_pad|><|vision_end|> (which is verry unique for this model and not registered in any chat handler so far) and i didnt realy see where the base64 encoded string / image_url should go in this template to be true.

@la1ty could you guys try the 2B and see if it works? That's the one I tested...

Feb 09 '25 17:02 samkoesnadi

@samkoesnadi which chat handler did you use if i may ask? the exact url to the model you used there would be usefull aswell.

Feb 09 '25 17:02 kseyhan

@kseyhan Yes, that's what I exactly experienced.

And I don't know if I make errors in compiling, but I found that text responses generating by Qwen2-VL-7b with llama-cpp-python v0.3.7 are mostly nonsense, which is not identical to the behavior in llama-cli.exe. Maybe I need to recompile it with the latest version of llama.cpp.

@samkoesnadi Yes it works with llama-cli.exe and llama-qwen2vl-cli.exe in llama.cpp, though llama-qwen2vl-cli.exe has an encoding problem for non-ascii characters on Windows platform?

Feb 10 '25 02:02 la1ty

Comments that may be off the topic:

~~I tested Qwen2.5-VL-7B in several use cases and it seemed that it didn't perform better than MiniCPM-O-2.6. If you want to build a visual application, don't expect too much from it. Just use another model or wait for the next version...~~

Qwen2.5-VL-32B has been released and it is the best open source OCR model for now. Support for Qwen2.5-VL series is still valuable.

Mar 19 '25 03:03 la1ty

Is there any update on supporting qwen2-vl?

Aug 25 '25 08:08 hermeschen1116