Add minicpm-o and qwen2-vl to the list of supported multimodal models.
Support for the Qwen2-VL and MiniCPM-o models would be nice. They already have have been merged into the llava subproject of llama.cpp.
+1
hmm, just tested again. maybe was me or i did pull an outdated llama or what last time. minicpm-o seems to work with the "minicpm-v-2.6" chat handler.
Yes, minicpm-o-2.6 works with the minicpm-v-2.6 chat handler. But Qwen2-VL seems does not work with any existing chat handler.
I try to use the example chat template from llama.cpp but it still generate random characters...
Yes, minicpm-o-2.6 works with the minicpm-v-2.6 chat handler. But Qwen2-VL seems does not work with any existing chat handler.
I try to use the example chat template from llama.cpp but it still generate random characters...
This is interesting. Could you give us the GGUF model urls you are using?
@samkoesnadi I downloaded them from HuggingFace. Hope you have some good news.
@samkoesnadi i tried my luck with Qwen2-VL-7B-Instruct-GGUF and tried almost every registered chat handler that includes a <|im_start|> and <|im_end|> token in the template and got the same results as @la1ty with random words in random languages as reply.
i also tried to implment the chat template myself but unfortunately did fail since i didnt realy understand the jinja template:
{
"chat_template": "{% set image_count = namespace(value=0) %}{% set video_count = namespace(value=0) %}{% for message in messages %}{% if loop.first and message['role'] != 'system' %}<|im_start|>system\nYou are a helpful assistant.<|im_end|>\n{% endif %}<|im_start|>{{ message['role'] }}\n{% if message['content'] is string %}{{ message['content'] }}<|im_end|>\n{% else %}{% for content in message['content'] %}{% if content['type'] == 'image' or 'image' in content or 'image_url' in content %}{% set image_count.value = image_count.value + 1 %}{% if add_vision_id %}Picture {{ image_count.value }}: {% endif %}<|vision_start|><|image_pad|><|vision_end|>{% elif content['type'] == 'video' or 'video' in content %}{% set video_count.value = video_count.value + 1 %}{% if add_vision_id %}Video {{ video_count.value }}: {% endif %}<|vision_start|><|video_pad|><|vision_end|>{% elif 'text' in content %}{{ content['text'] }}{% endif %}{% endfor %}<|im_end|>\n{% endif %}{% endfor %}{% if add_generation_prompt %}<|im_start|>assistant\n{% endif %}"
}
the template expects <|vision_start|><|image_pad|><|vision_end|> (which is verry unique for this model and not registered in any chat handler so far) and i didnt realy see where the base64 encoded string / image_url should go in this template to be true.
@samkoesnadi i tried my luck with Qwen2-VL-7B-Instruct-GGUF and tried almost every registered chat handler that includes a <|im_start|> and <|im_end|> token in the template and got the same results as @la1ty with random words in random languages as reply.
i also tried to implment the chat template myself but unfortunately did fail since i didnt realy understand the jinja template:
{ "chat_template": "{% set image_count = namespace(value=0) %}{% set video_count = namespace(value=0) %}{% for message in messages %}{% if loop.first and message['role'] != 'system' %}<|im_start|>system\nYou are a helpful assistant.<|im_end|>\n{% endif %}<|im_start|>{{ message['role'] }}\n{% if message['content'] is string %}{{ message['content'] }}<|im_end|>\n{% else %}{% for content in message['content'] %}{% if content['type'] == 'image' or 'image' in content or 'image_url' in content %}{% set image_count.value = image_count.value + 1 %}{% if add_vision_id %}Picture {{ image_count.value }}: {% endif %}<|vision_start|><|image_pad|><|vision_end|>{% elif content['type'] == 'video' or 'video' in content %}{% set video_count.value = video_count.value + 1 %}{% if add_vision_id %}Video {{ video_count.value }}: {% endif %}<|vision_start|><|video_pad|><|vision_end|>{% elif 'text' in content %}{{ content['text'] }}{% endif %}{% endfor %}<|im_end|>\n{% endif %}{% endfor %}{% if add_generation_prompt %}<|im_start|>assistant\n{% endif %}" }the template expects <|vision_start|><|image_pad|><|vision_end|> (which is verry unique for this model and not registered in any chat handler so far) and i didnt realy see where the base64 encoded string / image_url should go in this template to be true.
@la1ty could you guys try the 2B and see if it works? That's the one I tested...
@samkoesnadi which chat handler did you use if i may ask? the exact url to the model you used there would be usefull aswell.
@kseyhan Yes, that's what I exactly experienced.
And I don't know if I make errors in compiling, but I found that text responses generating by Qwen2-VL-7b with llama-cpp-python v0.3.7 are mostly nonsense, which is not identical to the behavior in llama-cli.exe. Maybe I need to recompile it with the latest version of llama.cpp.
@samkoesnadi Yes it works with llama-cli.exe and llama-qwen2vl-cli.exe in llama.cpp, though llama-qwen2vl-cli.exe has an encoding problem for non-ascii characters on Windows platform?
Comments that may be off the topic:
~~I tested Qwen2.5-VL-7B in several use cases and it seemed that it didn't perform better than MiniCPM-O-2.6. If you want to build a visual application, don't expect too much from it. Just use another model or wait for the next version...~~
Qwen2.5-VL-32B has been released and it is the best open source OCR model for now. Support for Qwen2.5-VL series is still valuable.
Is there any update on supporting qwen2-vl?