Some LlamaCpp models incoherent in Guidance, works fine when using llama-cpp-python directly
The bug I've had GGUFs of GLM-4.5-Air and Qwen3-32B loaded via LlamaCpp produce nonsense in Guidance. They work fine when inferencing directly with llama.cpp. Gemma 3 27B seems to work as expected with guidance. I've also tried inferencing with CPU and recompiling llama.cpp to use Vulkan to rule out ROCm as the cause.
To Reproduce Give a full working code snippet that can be pasted into a notebook cell or python file. Make sure to include the LLM load step so we know which model you are using.
from guidance.chat import ChatTemplate
from guidance.models import LlamaCpp
from guidance import system, user, assistant, gen
sampling_params = {
"top_p": 0.95,
"top_k": 40,
"min_p": 0.0,
}
class GLM45ChatTemplate(ChatTemplate):
def get_role_start(self, role_name):
match role_name:
case "system" | "user" | "assistant":
return f"<|{role_name}|>\n"
case _:
raise ValueError(f'Unrecognized role_name "{role_name}"')
def get_role_end(self, role_name=None):
return ""
glm = LlamaCpp("GLM-4.5-Air-UD-IQ2_XXS.gguf", chat_template=GLM45ChatTemplate, n_gpu_layers=0, n_ctx=8192, sampling_params=sampling_params, verbose=True)
llm = glm
with system():
llm += "You are a helpful assistant."
with user():
llm += "Tell me a joke./nothink"
with assistant():
llm += "<think></think>" + gen(max_tokens=32, temperature=0.6)
print(llm)
# <|system|>
# You are a helpful assistant.<|user|>
# Tell me a joke./nothink<|assistant|>
# <think></think> user is is a comedian. The user asks the assistant for a joke. The assistant is a function that takes user provides assistant for the joke. The user asks
llama-cpp-python output for comparison:
In [3]: output = llm(
...: """<|system|>
...: You are a helpful assistant<|user|>
...: Tell me a joke./nothink<|assistant|>
...: """, # Prompt
...: max_tokens=32,
...: stop=["<|user|>"],
...: echo=True,
...: temperature=0.6,
...: top_p=0.95,
...: min_p=0.0,
...: top_k=40,
...: )
...: print(output)
llama_perf_context_print: load time = 2346.95 ms
llama_perf_context_print: prompt eval time = 2346.73 ms / 17 tokens ( 138.04 ms per token, 7.24 tokens per second)
llama_perf_context_print: eval time = 1888.30 ms / 15 runs ( 125.89 ms per token, 7.94 tokens per second)
llama_perf_context_print: total time = 4240.59 ms / 32 tokens
llama_perf_context_print: graphs reused = 14
{'id': 'cmpl-dc889d36-6b84-488a-99b7-7380b34bdc80', 'object': 'text_completion', 'created': 1759719671, 'model': '/mnt/data/models/GLM-4.5-Air-UD-IQ2_XXS.gguf', 'choices': [{'text': "<|system|>\nYou are a helpful assistant<|user|>\nTell me a joke./nothink<|assistant|>\n<think></think>Why don't scientists trust atoms?\n\nBecause they make up everything!", 'index': 0, 'logprobs': None, 'finish_reason': 'stop'}], 'usage': {'prompt_tokens': 17, 'completion_tokens': 15, 'total_tokens': 32}}
# <|system|>
# You are a helpful assistant<|user|>
# Tell me a joke./nothink<|assistant|>
# <think></think>Why don't scientists trust atoms?
#
# Because they make up everything!
System info (please complete the following information):
- OS (e.g. Ubuntu, Windows 11, Mac OS, etc.): Arch Linux
- Guidance Version (
guidance.__version__): 0.3.0 - llama-cpp-python version: 0.3.16, tried both
-DGGML_VULKAN=1and-DGGML_HIP=1
Thank you for filing this issue. Unfortunately this is something that can happen with particular models. We will try to investigate, to see if there is a straightforward fix.
In GLM-4.5-Air's case, it seems the issue was the special tokens [gMASK]<sop> weren't being added to the beginning of the context
@feffy380 were you able to work around this while using guidance?