feat: Add Gemma3 chat handler (#1976)
Added gemma3 chat handler, and fixed the image embedding, supports multiple images.
Included llamacpp functions and structures:
- clip_image_load_from_bytes
- clip_image_batch_encode
- clip_image_preprocess
- clip_image_f32_batch_init
- clip_image_f32_batch_free
- clip_image_u8_init
- clip_image_u8_free
Usage (Current version, after Apr 4 2025):
from llama_cpp import Llama
from llama_cpp.llama_chat_format import Gemma3ChatHandler
chat_handler = Gemma3ChatHandler(clip_model_path="path/to/mmproj")
llama = Llama(
model_path="path/to/model",
chat_handler=chat_handler,
n_ctx=1024, # n_ctx should be increased to accommodate the image embedding
)
messages = [
{
'role': 'user',
'content': [
{'type': 'text', 'text': 'Please describe this image'},
{'type': 'image_url', 'image_url': 'https://raw.githubusercontent.com/huggingface/transformers/refs/heads/main/tests/fixtures/tests_samples/COCO/000000039769.png'},
]
}
]
output = llama.create_chat_completion(
messages,
stop=['<end_of_turn>', '<eos>'],
max_tokens=200,
)
print(output['choices'][0]['message']['content'])
- {'type': 'image', 'image': ...}
+ {'type': 'image_url', 'image_url': ...}
Test Results:
- Passed local environment tests: Python 3.12,
unsloth/gemma-3-4b-it-GGUF,unsloth/gemma-3-12b-it-GGUF,unsloth/gemma-3-27b-it-GGUF,bartowski/google_gemma-3-12b-it-GGUF
Compatibility:
- Fully backward compatible with existing interfaces.
- Maintains original APIs while adding new options and interfaces.
i've been using it a bit it works nicely, had to find out the message structure but maybe that's normal for different chat handlers. I'm not that familiar with llama-cpp
"type": "image",
"image": {
"url": "https://image.com/img.jpg",
}
i was used to "image_url" for both places "image_url" is used now.
How would that work with a local image?
Sorry, i didn't modify the origin chat template of gemma3 and then used "type": "image". Now i have changed the format of the messages to be compatible with the openai api, just like other chat handlers.
Here is a full example:
from pathlib import Path
from llama_cpp import Llama
from llama_cpp.llama_chat_format import Gemma3ChatHandler
def image_to_base64_uri(image: bytes | str):
import base64
import urllib.request as request
if isinstance(image, bytes):
data = base64.b64encode(image).decode('utf-8')
else:
with request.urlopen(image) as f:
data = base64.b64encode(f.read()).decode('utf-8')
return f'data:image/png;base64,{data}'
chat_handler = Gemma3ChatHandler(clip_model_path='path/to/mmproj')
llama = Llama(
model_path='path/to/model',
chat_handler=chat_handler,
n_ctx=2048, # n_ctx should be increased to accommodate the image embedding
)
messages = [
{
'role': 'user',
'content': [
{'type': 'text', 'text': 'please compare these pictures'},
{'type': 'image_url', 'image_url': 'https://xxxx/img1.jpg'},
{'type': 'image_url', 'image_url': {'url': 'https://xxxx/img2.png'}},
{'type': 'image_url', 'image_url': image_to_base64_uri(Path('path/to/img3.jpg').read_bytes())},
{'type': 'image_url', 'image_url': {'url': image_to_base64_uri(Path('path/to/img4.png').read_bytes())}},
{'type': 'text', 'text': 'and then tell me which one looks the best'},
]
}
]
output = llama.create_chat_completion(
messages,
stop=['<end_of_turn>', '<eos>'],
max_tokens=500,
stream=True,
)
for chunk in output:
delta = chunk['choices'][0]['delta']
if 'role' in delta:
print(delta['role'], end=':\n')
elif 'content' in delta:
print(delta['content'], end='')
llama._sampler.close()
llama.close()
bump on this, thanks for your work! gemma3 is a great model to have support to, I'm waiting on it!
Hey @kossum just wondering, does this handler support function calling? I ask because the handler for llava1.5 does support multimodal (vision) and also tool calling at once, as Gemma3 also has tool calling capabilities, it would be great to add both into a single handler!
Hello @joaojhgs, gemma3 (especially the 12b and 27b versions) has strong instruction-following abilities and can generate structured function call outputs through well-designed prompts.
But unlike gpt4 or claude, gemma3 does not have builtin support for tool call tokens or json schema enforcement. That means:
- No builtin tool use markers: gemma3 does not automatically identify or tag tool usage.
- Requires explicit prompt design: you need to clearly define function names, parameters, and output format in the prompt.
- Lacks standardized templates: currently, gemma3’s chat_template does not include tool use structures.
So to implement function calling with gemma3, you must rely on carefully designed prompts to guide the model in producing the correct format.
Simple example:
import json
from llama_cpp import Llama
from llama_cpp.llama_chat_format import Gemma3ChatHandler
chat_handler = Gemma3ChatHandler(clip_model_path='path/to/mmproj')
llama = Llama(
model_path='path/to/model',
chat_handler=chat_handler,
n_ctx=2048, # n_ctx should be increased to accommodate the image embedding
)
def analyze_image(image_id: str, description: str):
print('image_id:', image_cache.get(image_id))
print('description:', description)
...
image_cache = {'img_01': 'https://xxxx/img_01.jpg'}
function_table = {'analyze_image': analyze_image}
# input arg1
image_id = 'img_01'
# input arg2
question = f'Here is the image with ID `img_01`. Please analyze it.'
output = llama.create_chat_completion(
[
{
'role': 'system',
'content': '''You can call the following function:
- analyze_image(image_id: str, description: str)
You will be shown an image. First, analyze and describe its content in detail.
Then, return a function call with:
- the assigned image_id (provided in the input)
- a description of what the image shows (your own analysis)
Respond only with a JSON (without code blocks) function call like:
{
"function": "analyze_image",
"arguments": {
"image_id": "<image id>",
"description": "<description of the image>"
}
}
'''
},
{
'role': 'user',
'content': [
{'type': 'text', 'text': question},
{'type': 'image_url', 'image_url': image_cache[image_id]},
]
}
],
stop=['<end_of_turn>', '<eos>'],
max_tokens=500,
)
data = json.loads(output['choices'][0]['message']['content'])
result = function_table[data['function']](**data['arguments'])
...
Naturally, if multimodal capabilities aren’t needed, this chat handler can be omitted.
Hello @joaojhgs, gemma3 (especially the 12b and 27b versions) has strong instruction-following abilities and can generate structured function call outputs through well-designed prompts.
But unlike gpt4 or claude, gemma3 does not have builtin support for tool call tokens or json schema enforcement.
Thanks, I didn't know about that!
@abetlen I've been waiting for this to be merged for some time. I'm curious are you still actively maintaining this repo? Thanks!
Say Hey- I added your code to my venv, and when running your example I received this error:
Traceback (most recent call last):
File "d:\Foundary\Gemma-3_llama.py", line 57, in
So, seeing the "OSError" label, I wanted to ask if you had run your code under windows 11?
Hello @Domino9752 ,thanks for testing! The issue happens because the author updated the llama.cpp library in the current 0.3.9 version, but the corresponding changes for the llava part haven’t been made yet. Could you please try rolling back to version 0.3.8 and see if it works? Alternatively, you can use my fork for now. I’ll update it soon with the necessary fixes.
kossum-
Thanks for your help. I changed over to your branch:
python -m pip install git+https://github.com/kossum/llama-cpp-python@gemma3-fix --no-cache-dir --force-reinstall --upgrade --config-settings="cmake.args=-DGGML_CUDA=on"
...and now it runs. I previously had the syntax shown in the April 4th message using "image_url". Upon reading llama_chat_format.py, I realized the correct syntax for the messages (with @gemma3-fix) is:
path_to_image = r"D:\FLUX1-dev\inputs\Clipped\0515-2200-6-2_224.png"
messages = [
{
'role': 'user',
'content': [
{'type': 'text', 'text': "Please describe this image in great detail, OK?"},
{'type': 'image', 'url': image_to_base64_uri(Path(path_to_image).read_bytes())},
]
}
]
Intel i7 14700K / RTX3090 Llama.generate: 528 prefix-match hit, remaining 1 prompt tokens to eval llama_perf_context_print: load time = 46.10 ms llama_perf_context_print: prompt eval time = 1241.25 ms / 530 tokens ( 2.34 ms per token, 426.99 tokens per second) llama_perf_context_print: eval time = 25365.96 ms / 684 runs ( 37.08 ms per token, 26.97 tokens per second) llama_perf_context_print: total time = 26286.61 ms / 1214 tokens
Hi @kossum, thank you for the work. I've tested it, and it works well. Here's the test code:
- install from your branch
pip install git+https://github.com/kossum/llama-cpp-python.git@main - and then run testing:
import os
import requests
from llama_cpp import Llama
from llama_cpp.llama_chat_format import Gemma3ChatHandler
# URLs for the model weights
MODEL_URL = "https://huggingface.co/vinimuchulski/gemma-3-4b-it-qat-q4_0-gguf/resolve/main/gemma-3-4b-it-q4_0.gguf?download=true"
MMPROJ_URL = "https://huggingface.co/vinimuchulski/gemma-3-4b-it-qat-q4_0-gguf/resolve/main/mmproj-model-f16-4B.gguf?download=true"
MODEL_FILE = "gemma-3-4b-it-q4_0.gguf"
MMPROJ_FILE = "mmproj-model-f16-4B.gguf"
def download_file(url, local_path):
if not os.path.exists(local_path):
print(f"Downloading {local_path} ...")
with requests.get(url, stream=True) as r:
r.raise_for_status()
with open(local_path, 'wb') as f:
for chunk in r.iter_content(chunk_size=8192):
f.write(chunk)
print(f"Downloaded {local_path}")
else:
print(f"{local_path} already exists, skipping download.")
# Download the weights if they don't exist
download_file(MODEL_URL, MODEL_FILE)
download_file(MMPROJ_URL, MMPROJ_FILE)
# Initialize the multimodal chat handler
chat_handler = Gemma3ChatHandler(clip_model_path=MMPROJ_FILE)
# Load the Gemma 3 model with multimodal support
llm = Llama(
model_path=MODEL_FILE,
chat_handler=chat_handler,
n_ctx=2048, # You can increase or decrease as needed
)
# Sample inference: describe an image from a URL
messages = [
{
"role": "user",
"content": [
{"type": "text", "text": "Please describe this image."},
{"type": "image_url", "image_url": {"url": "https://raw.githubusercontent.com/huggingface/transformers/refs/heads/main/tests/fixtures/tests_samples/COCO/000000039769.png"}},
]
}
]
output = llm.create_chat_completion(
messages,
stop=['<end_of_turn>', '<eos>'],
max_tokens=200,
)
print("Model output:", output['choices'][0]['message']['content'])
There is one issue I want to ask for help - the output is speed is quite slow. On CPU kernel:
Llama.generate: 299 prefix-match hit, remaining 1 prompt tokens to eval llama_perf_context_print: load time = 458.17 ms llama_perf_context_print: prompt eval time = 363380.98 ms / 301 tokens ( 1207.25 ms per token, 0.83 tokens per second) llama_perf_context_print: eval time = 97226.48 ms / 199 runs ( 488.58 ms per token, 2.05 tokens per second) llama_perf_context_print: total time = 98303.12 ms / 500 tokens
On GPU kernel:
Llama.generate: 299 prefix-match hit, remaining 1 prompt tokens to eval llama_perf_context_print: load time = 401.93 ms llama_perf_context_print: prompt eval time = 285504.85 ms / 301 tokens ( 948.52 ms per token, 1.05 tokens per second) llama_perf_context_print: eval time = 80909.79 ms / 199 runs ( 406.58 ms per token, 2.46 tokens per second) llama_perf_context_print: total time = 81813.53 ms / 500 tokens
When I test the same model in LM Studio, the speed is much faster, it's around 4-9 tokens/second with CPU, and with GPU it can go as high as 60 TPS. Why there is a difference between me using llama_cpp_python and LM Studio for the same model? Do you have some suggestions to make output speed faster?
Hi @xia0nan, thanks for your feedback.
If you find that inference speed is slow, please note that you can use the n_gpu_layers parameter to specify how many transformer layers should be offloaded to the GPU. For example:
llm = llama.Llama(
model_path=MODEL_PATH,
chat_handler=Gemma3ChatHandler(clip_model_path=MMPROJ_PATH),
n_gpu_layers=48,
n_ctx=1024,
)
However, in order to use the GPU, you need to build llama-cpp-python with the appropriate backend. Please refer to the Supported Backends section in the README for detailed instructions on enabling GPU support during installation.
Also, LM Studio may use GPU acceleration by default or include additional optimizations, which can explain the speed difference.
Please note that GPU support and related build issues are not directly within the scope of this chat handler. If you encounter any problems when compiling with GPU support, feel free to open a separate issue. For more general installation or backend-related questions, you may also want to refer to the main llama-cpp-python repository.
hey @kossum, just out of curiosity, is it possible to port this into my project without waiting for it to be mergerd or using your branch? I mean making a class override, I've done that before in cases where I couldn't wait for a third party lib merge, but this one has those ctypes functions that looks like compiled stuff that I'd need support from the lib directly, could I add those ctypes directly into my project as well?
Also, as I have been trying to load the mmproj model, the loading function fails with complaints about the mmproj file missing some required keys, such as general.description, clip.hast_text_encoder etc, I'm using the unsloth 4b model and their F16 mmproj file, am I perhaps missing something?
could I add those ctypes directly into my project as well?
You can. The trick is you have to duplicate the ctypes file into your project, override the init method from Llava and import from your local ctypes when instantiating.
Thanks for the PR, @kossum - have been looking into this for a few hours, and was on the verge of taking a stab myself, when I found you had already done it more than a month ago (and probably better than I would have managed)!
While we're waiting for @abetlen to take a look, do you mind rebasing your fork against the tip of main / 0.3.14? I would just use your fork, but we've been using the qwen2.5-vl format which has been subsequently added, and it would be sad to lose support for it.
I'm also happy to take a stab at PR, along the lines of your gemma-for-0.3.9, but am busy waiting for a new local machine to arrive to properly test it out
Hi @Gordonei , thanks for your message and for the detailed feedback! I haven’t logged into github for a while, so I wasn’t aware of the recent updates. I’ll rebase and update the PR to support the latest version soon. Thanks for your patience.
No worries, thanks for considering taking it on!
Updated to 0.3.14. Since llama.cpp’s mtmd module now covers the image embedding for gemma3, I’ve removed my previous implementation and am now only keeping the chat_template.
any updates on this?
Does it work?
hi all! useful PR here :) any updates or time to merge?
You can try this stand alone script , tested with 0.3.16
import os, json, base64, argparse, logging
from typing import List
from llama_cpp import Llama
from llama_cpp.llama_chat_format import Llava15ChatHandler
class Gemma3ChatHandler(Llava15ChatHandler):
DEFAULT_SYSTEM_MESSAGE = None
CHAT_FORMAT = (
"{{ bos_token }}"
"{% for message in messages %}"
"{% if message.role == 'system' %}"
"<start_of_turn>user\n{{ message.content }}<end_of_turn>\n"
"<start_of_turn>model\nUnderstood.<end_of_turn>\n"
"{% endif %}{% endfor %}"
"{% for message in messages %}{% if message.role != 'system' %}"
"<start_of_turn>{{ message.role }}\n"
"{% if message.content is string %}{{ message.content }}"
"{% else %}{% for c in message.content %}"
"{% if c.type == 'text' and c.text %}{{ c.text }}{% endif %}"
"{% if c.type == 'image_url' %}{{ c.image_url.url }}{% endif %}"
"{% endfor %}{% endif %}"
"<end_of_turn>\n"
"{% endif %}{% endfor %}"
"{% if add_generation_prompt %}<start_of_turn>model\n{% endif %}"
)
def image_to_base64_uri(path: str) -> str:
with open(path, "rb") as f:
b64 = base64.b64encode(f.read()).decode("utf-8")
ext = os.path.splitext(path)[1].lower()
mime = {"jpg":"image/jpeg","jpeg":"image/jpeg","png":"image/png","gif":"image/gif","webp":"image/webp"}.get(ext.lstrip("."),"image/png")
return f"data:{mime};base64,{b64}"
class Gemma3Vision:
def __init__(
self,
repo_id="unsloth/gemma-3-4b-it-GGUF",
filename="gemma-3-4b-it-Q4_K_M.gguf",
mmproj_repo="unsloth/gemma-3-4b-it-GGUF",
mmproj_file="mmproj-F16.gguf",
n_gpu_layers=63,
):
self.repo_id = repo_id
self.filename = filename
self.mmproj_repo = mmproj_repo
self.mmproj_file = mmproj_file
self.n_gpu_layers = n_gpu_layers
self.llm = None
def _load(self):
if self.llm: return
from huggingface_hub import hf_hub_download
mmproj_path = hf_hub_download(self.mmproj_repo, self.mmproj_file, resume_download=True)
chat_handler = Gemma3ChatHandler(clip_model_path=mmproj_path, verbose=False)
self.llm = Llama.from_pretrained(
repo_id=self.repo_id,
filename=self.filename,
chat_handler=chat_handler,
n_ctx=8192,
n_gpu_layers=self.n_gpu_layers,
n_batch=512,
verbose=False,
)
def describe(self, image_path: str, prompt: str) -> str:
self._load()
msg = [{
"role":"user",
"content":[
{"type":"text","text":prompt},
{"type":"image_url","image_url":{"url":image_to_base64_uri(image_path)}},
],
}]
res = self.llm.create_chat_completion(
messages=msg,
stop=["<end_of_turn>","<start_of_turn>"],
max_tokens=500,
temperature=0.3,
top_p=0.9,
)
return res["out" in res and "out" or "choices"][0]["message"]["content"] # supports some llama_cpp variants
def main():
p = argparse.ArgumentParser(description="Describe a single image with Gemma 3 vision")
p.add_argument("image", help="Path to image")
p.add_argument("--prompt", default="Describe this image in detail.")
p.add_argument("--repo-id", default="unsloth/gemma-3-4b-it-GGUF")
p.add_argument("--filename", default="gemma-3-4b-it-Q4_K_M.gguf")
p.add_argument("--mmproj-repo", default="unsloth/gemma-3-4b-it-GGUF")
p.add_argument("--mmproj-file", default="mmproj-F16.gguf")
p.add_argument("-g","--gpu-layers", type=int, default=63)
p.add_argument("-o","--output", help="Write JSON result to file")
args = p.parse_args()
if not os.path.exists(args.image):
logging.error(f"Image not found: {args.image}")
return
model = Gemma3Vision(
repo_id=args.repo_id,
filename=args.filename,
mmproj_repo=args.mmproj_repo,
mmproj_file=args.mmproj_file,
n_gpu_layers=args.gpu_layers,
)
desc = model.describe(args.image, args.prompt)
print(f"Image: {args.image}\nDescription: {desc}")
if args.output:
with open(args.output, "w", encoding="utf-8") as f:
json.dump({"image": args.image, "prompt": args.prompt, "description": desc}, f, indent=2)
logging.info(f"Wrote {args.output}")
if __name__ == "__main__":
main()
Is this going to be merged?
@geocine Wondering what's the performance like - input token size roughly per image - patching size.
Is this going to be merged?
@dchatel Months after, but perhaps still useful: you can try this for local images. However, I have overridden the method that supports urls so it should work only for local images.
class Gemma3ChatHandler(Llava15ChatHandler):
DEFAULT_SYSTEM_MESSAGE = None
CHAT_FORMAT = (
"{% if messages[0]['role'] == 'system' %}"
"{% if messages[0]['content'] is string %}"
"{% set first_user_prefix = messages[0]['content'] + '\n\n' %}"
"{% else %}"
"{% set first_user_prefix = messages[0]['content'][0]['text'] + '\n\n' %}"
"{% endif %}"
"{% set loop_messages = messages[1:] %}"
"{% else %}"
"{% set first_user_prefix = \"\" %}"
"{% set loop_messages = messages %}"
"{% endif %}"
"{% for message in loop_messages %}"
"{% if (message['role'] == 'user') != (loop.index0 % 2 == 0) %}"
"{{ raise_exception(\"Conversation roles must alternate user/assistant/user/assistant/...\") }}"
"{% endif %}"
"{% if (message['role'] == 'assistant') %}"
"{% set role = \"model\" %}"
"{% else %}"
"{% set role = message['role'] %}"
"{% endif %}"
"{{ '<start_of_turn>' + role + '\n' + (first_user_prefix if loop.first else \"\") }}"
"{% if message['content'] is string %}"
"{{ message['content'] | trim }}"
"{% elif message['content'] is iterable %}"
"{% for item in message['content'] %}"
"{% if item['type'] == 'image_url' and item['image_url'] is string %}"
"{{ '\n\n' + item['image_url'] + '\n\n' }}"
"{% elif item['type'] == 'image_url' and item['image_url'] is mapping %}"
"{{ '\n\n' + item['image_url']['url'] + '\n\n' }}"
"{% elif item['type'] == 'text' %}"
"{{ item['text'] | trim }}"
"{% endif %}"
"{% endfor %}"
"{% else %}"
"{{ raise_exception(\"Invalid content type\") }}"
"{% endif %}"
"{{ '<end_of_turn>\n' }}"
"{% endfor %}"
"{% if add_generation_prompt %}"
"{{ '<start_of_turn>model\n' }}"
"{% endif %}"
)
def load_image(self, image_url: str) -> bytes:
return self._load_image(image_url)
@staticmethod
def _load_image(image_url: str) -> bytes:
with open(image_url, "rb") as f:
return f.read()
chat_handler = Gemma3ChatHandler(clip_model_path="path/to/mmproj")
# Load the model once and reuse it
llm = Llama(
model_path="path/to/model",
n_ctx=4096,
n_gpu_layers=-1,
chat_handler=chat_handler,
verbose=False
)
def image_text_to_text(prompt: str, image_path: str, max_tokens: int = 512):
response = llm.create_chat_completion(
messages = [
{
'role': 'user',
'content': [
{'type': 'text', 'text': prompt},
{'type': 'image_url', 'image_url': image_path},
]
}
],
max_tokens=max_tokens,
temperature=1
)
return response["choices"][0]["message"]["content"]