Misc. bug: llama-sampling.cpp:204: GGML_ASSERT(cur_p->size > 0) failed
Name and Version
$./llama-cli --version
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: yes
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
version: 5329 (611aa914)
built with cc (Debian 12.2.0-14) 12.2.0 for x86_64-linux-gnu
Operating systems
Linux
Which llama.cpp modules do you know to be affected?
llama-cli
Command line
llama-cli \
--log-file /tmp/llamacpp-Qwen3-30B-A3B-Q8_K_XL.log \
--hf-repo unsloth/Qwen3-30B-A3B-GGUF:Q8_K_XL \
--override-tensor '([0-9]+).ffn_.*_exps.=CPU' \
--n-gpu-layers 48 \
--jinja \
--cache-type-k q8_0 \
--ctx-size 32768 \
--samplers "top_k;dry;min_p;temperature;top_p" \
--min-p 0.005 \
--top-p 0.97 \
--top-k 40 \
--temp 0.7 \
--dry-multiplier 0.7 \
--dry-allowed-length 4 \
--dry-penalty-last-n 2048 \
--presence-penalty 0.05 \
--frequency-penalty 0.005 \
--repeat-penalty 1.01 \
--repeat-last-n 16 \
--verbose \
--file generic-prompt-for-testing-1906words.txt
Problem description & steps to reproduce
The log file of the output, together with what I hope is all the relevant information can be found in this ephemeral repo I put up for this bug report: https://github.com/bjodah/bug-reproducer-llamacpp-assert-triggering/tree/main
It might very well that I'm doing something awfully wrong here, but since it's an assert that is triggering, I'm thinking that you might be interested in a bug report?
I first observed this error using llama-serve on my laptop (ubuntu 24.04, geforce 1050 mobile), but everything in this bug report was reproduced on a more modern system (debian, geforce rtx 3090).
First Bad Commit
Qwen 3 support is pretty recent, so I haven't figured out what's the relevant oldest commit for a bisection.
Relevant log output
/... lots of output, see log file in repo linked in issue description .../
eval: [ 'G':38 ]
Gn_past = 2620
/home/bjorn/vc/llama.cpp/src/llama-sampling.cpp:204: GGML_ASSERT(cur_p->size > 0) failed
/home/bjorn/.gdbinit:2: Error in sourced command file:
/home/bjorn/dotfiles/per-file/.gdbinit:22: Error in sourced command file:
Scripting in the "Python" language is not supported in this copy of GDB.
ptrace: Operation not permitted.
No stack.
The program is not being run.
...I should have added a --seed flag, but the issue is reproducible for me with all seeds I've tried so far.
The issue has to do with --dry-allowed-length 4:
...
Now finish your task according to taskDefinition, only write the poem, add no commentary.
assistant
GGGG/home/bjorn/vc/llama.cpp/src/llama-sampling.cpp:204: GGML_ASSERT(cur_p->size > 0) failed
If I adjust this to --dry-allowed-length 9 we see nine captial G before the assert:
...
Now finish your task according to taskDefinition, only write the poem, add no commentary.
assistant
GGGGGGGGG/home/bjorn/vc/llama.cpp/src/llama-sampling.cpp:204: GGML_ASSERT(cur_p->size > 0) failed
I'm seeing this bug as well, I'm not passing in --dry-allowed-length 4.
main: server is listening on http://0.0.0.0:8089 - starting the main loop srv update_slots: all slots are idle srv params_from_: Chat format: Content-only slot launch_slot_: id 0 | task 0 | processing task slot update_slots: id 0 | task 0 | new prompt, n_ctx_slot = 16128, n_keep = 0, n_prompt_tokens = 88 slot update_slots: id 0 | task 0 | kv cache rm [0, end) slot update_slots: id 0 | task 0 | prompt processing progress, n_past = 88, n_tokens = 88, progress = 1.000000 slot update_slots: id 0 | task 0 | prompt done, n_past = 88, n_tokens = 88 /home/seg/llama.cpp/src/llama-sampling.cpp:204: GGML_ASSERT(cur_p->size > 0) failed Could not attach to process. If your uid matches the uid of the target process, check the setting of /proc/sys/kernel/yama/ptrace_scope, or try again as the root user. For more details, see /etc/sysctl.d/10-ptrace.conf ptrace: Operation not permitted. No stack. The program is not being run. dsv3b.sh: line 8: 68577 Aborted (core dumped) ~/llama.cpp/build/bin/llama-server -ngl 62 --host 0.0.0.0 --path ~/llama.cpp/examples/server/public -m /llmzoo/models/DeepSeek-V3-0324-UD-Q3_K_XL.gguf --port 8089 --override-tensor "blk.([0-4]).ffn_(up|down)exp.=CUDA0,blk.([1][0257]|[5]).ffn(up|down)exp.=CUDA1,blk.([2][0257]|[6]).ffn(up|down)exp.=CUDA2,blk.([3][0257]|[7]).ffn(up|down)exp.=CUDA3,blk.([4][0257]|[6][01]).ffn(up|down)exp.=CUDA4,blk.([5][02579]|[6][2]).ffn(up|down)exp.=CUDA5,blk.([8-9]|[1-9][0-9]).ffn.exp.=CPU" -md ~/models/draft/DeepSeek-V3-0324-DRAFT-0.5B-Q8_0.gguf -ngld 127 -devd CUDA2 -cd 16000 -fa -mg 4 --no-mmap -c 16000
I can confirm the same behavior on MacOS. Version: llama-b5353-bin-macos-arm64.zip MacOS: 15.4.1 (24E263)
Error:
que start_loop: waiting for new tasks que start_loop: processing new tasks que start_loop: processing task, id = 1 que start_loop: update slots srv update_slots: posting NEXT_RESPONSE que post: new task, id = 2, front = 0 slot update_slots: id 0 | task 0 | kv cache rm [347, end) srv process_chun: processing image... image/slice encoded in 21169 ms decoding image batch 1/1, n_tokens_batch = 256 set_causal_attn: value = 0 image decoded (batch 1/1) in 6587 ms set_causal_attn: value = 1 srv process_chun: image processed in 27757 ms slot update_slots: id 0 | task 0 | prompt processing progress, n_past = 609, n_tokens = 6, progress = 1.000000 slot update_slots: id 0 | task 0 | prompt done, n_past = 609, n_tokens = 6 srv update_slots: decoding batch, n_tokens = 6 set_embeddings: value = 0 clear_adapter_lora: call /Users/runner/work/llama.cpp/llama.cpp/src/llama-sampling.cpp:204: GGML_ASSERT(cur_p->size > 0) failed zsh: abort /Users/myuserdir/Projects/llamacpp/bin/llama-server --model --mmproj 4096
command line to start llama-server:
/Users/myuserdir/Projects/llamacpp/bin/llama-server
--model /Users/myuserdir/Projects/ImageIndexer/resources/Qwen2-VL-2B-Instruct-Q6_K.gguf
--mmproj /Users/myuserdir/Projects/ImageIndexer/resources/mmproj-Qwen2-VL-2B-Instruct-f16.gguf
--ctx-size 4096
-v
JSON payload:
request: {
"max_tokens": 250,
"messages": [
{
"content": "You describe the image and generate keywords.",
"role": "system"
},
{
"content": [
{
"text": "The tasks are to describe the image and to come up with a large set of keyword tags for it.\n\nWrite the Description using the active voice.\n\nThe Keywords must be one or two words each. Generate as many Keywords as possible using a controlled and consistent vocabulary.\n\nFor both Description and Keywords, make sure to include:\n\n - Themes, concepts\n - Items, animals, objects\n - Structures, landmarks, setting\n - Foreground and background elements\n - Notable colors, textures, styles\n - Actions, activities\n\nIf humans are present, include: \n - Physical appearance\n - Gender\n - Clothing\n - Age range\n - Visibly apparent ancestry\n - Occupation/role\n - Relationships between individuals\n - Emotions, expressions, body language\n\nUse ENGLISH only. Generate ONLY a JSON object with the keys Description and Keywords as follows {"Description": str, "Keywords": []}\n<EXAMPLE>\nThe example input would be a stock photo of two apples, one red and one green, against a white backdrop and is a hypothetical Description and Keyword for a non-existent image.\nOUTPUT=json{\"Description\": \"Two apples next to each other, one green and one red, placed side by side against a white background. There is even and diffuse studio lighting. The fruit is glossy and covered with dropplets of water indicating they are fresh and recently washed. The image emphasizes the cleanliness and appetizing nature of the food\", \"Keywords\": [\"studio shot\",\"green\",\"fruit\",\"red\",\"apple\",\"stock image\",\"health food\",\"appetizing\",\"empty background\",\"grocery\",\"food\",\"snack\"]}\n</EXAMPLE> ",
"type": "text"
},
{
"image_url": {
"url": "data:image/jpeg;base64,...image content in base64 here..."
},
"type": "image_url"
}
],
"role": "user"
}
],
"min_p": 1.05,
"temperature": 0.1,
"top_k": 0,
"top_p": 1
}
The bug happens in the top-p sampler, I managed to get a debugging session going:
Now finish your task according to taskDefinition, only write the poem, add no commentary.
assistant
GGGGGGGGG/home/bjorn/vc/llama.cpp/src/llama-sampling.cpp:204: GGML_ASSERT(cur_p->size > 0) failed
Thread 1 "llama-cli" received signal SIGABRT, Aborted.
0x00007ffff44a9eec in ?? () from /lib/x86_64-linux-gnu/libc.so.6
(gdb) bt
#0 0x00007ffff44a9eec in ?? () from /lib/x86_64-linux-gnu/libc.so.6
#1 0x00007ffff445afb2 in raise () from /lib/x86_64-linux-gnu/libc.so.6
#2 0x00007ffff4445472 in abort () from /lib/x86_64-linux-gnu/libc.so.6
#3 0x00007ffff75b1741 in ggml_abort (file=0x7ffff7d40c98 "/home/bjorn/vc/llama.cpp/src/llama-sampling.cpp", line=204, fmt=0x7ffff7d40c81 "GGML_ASSERT(%s) failed") at /home/bjorn/vc/llama.cpp/ggml/src/ggml.c:216
#4 0x00007ffff7cbae1a in llama_sampler_softmax_impl (cur_p=0x55555b155a70) at /home/bjorn/vc/llama.cpp/src/llama-sampling.cpp:204
#5 0x00007ffff7cbc5f0 in llama_sampler_top_p_apply (smpl=0x55555b14a940, cur_p=0x55555b155a70) at /home/bjorn/vc/llama.cpp/src/llama-sampling.cpp:715
#6 0x00007ffff7cbb8c5 in llama_sampler_apply (smpl=0x55555b14a940, cur_p=0x55555b155a70) at /home/bjorn/vc/llama.cpp/src/llama-sampling.cpp:343
#7 0x00007ffff7cbbd33 in llama_sampler_chain_apply (smpl=0x55555b142d40, cur_p=0x55555b155a70) at /home/bjorn/vc/llama.cpp/src/llama-sampling.cpp:436
#8 0x00007ffff7cbb8c5 in llama_sampler_apply (smpl=0x55555b142d40, cur_p=0x55555b155a70) at /home/bjorn/vc/llama.cpp/src/llama-sampling.cpp:343
#9 0x00005555557cb1fb in common_sampler_sample (gsmpl=0x55555b1558f0, ctx=0x5555577f7260, idx=-1, grammar_first=false) at /home/bjorn/vc/llama.cpp/common/sampling.cpp:349
#10 0x00005555555e688c in main (argc=29, argv=0x7fffffffd728) at /home/bjorn/vc/llama.cpp/tools/main/main.cpp:699
(gdb) f 4
#4 0x00007ffff7cbae1a in llama_sampler_softmax_impl (cur_p=0x55555b155a70) at /home/bjorn/vc/llama.cpp/src/llama-sampling.cpp:204
204 GGML_ASSERT(cur_p->size > 0);
(gdb) l
199 cur_p->data[i].logit /= temp;
200 }
201 }
202
203 static void llama_sampler_softmax_impl(llama_token_data_array * cur_p) {
204 GGML_ASSERT(cur_p->size > 0);
205
206 // Sort the logits in descending order
207 if (!cur_p->sorted) {
208 std::sort(cur_p->data, cur_p->data + cur_p->size, [](const llama_token_data & a, const llama_token_data & b) {
(gdb) f 5
#5 0x00007ffff7cbc5f0 in llama_sampler_top_p_apply (smpl=0x55555b14a940, cur_p=0x55555b155a70) at /home/bjorn/vc/llama.cpp/src/llama-sampling.cpp:715
715 llama_sampler_softmax_impl(cur_p);
(gdb) l
710
711 if (ctx->p >= 1.0f) {
712 return;
713 }
714
715 llama_sampler_softmax_impl(cur_p);
716
717 // Compute the cumulative probabilities
718 float cum_sum = 0.0f;
719 size_t last_idx = cur_p->size;
(gdb) p ctx
$1 = (const llama_sampler_top_p *) 0x55555b145b60
(gdb) p ctx->p
$2 = 0.949999988
(gdb) p *ctx
$3 = {p = 0.949999988, min_keep = 0}
(gdb) p cur_p
$4 = (llama_token_data_array *) 0x55555b155a70
(gdb) p *cur_p
$5 = {data = 0x55555c0c88d0, size = 0, selected = -1, sorted = false}
(gdb) f 6
#6 0x00007ffff7cbb8c5 in llama_sampler_apply (smpl=0x55555b14a940, cur_p=0x55555b155a70) at /home/bjorn/vc/llama.cpp/src/llama-sampling.cpp:343
343 smpl->iface->apply(smpl, cur_p);
(gdb) l
338 }
339 }
340
341 void llama_sampler_apply(struct llama_sampler * smpl, struct llama_token_data_array * cur_p) {
342 GGML_ASSERT(smpl->iface->apply);
343 smpl->iface->apply(smpl, cur_p);
344 }
345
346 void llama_sampler_reset(struct llama_sampler * smpl) {
347 if (smpl->iface->reset) {
(gdb) p *smpl
$6 = {iface = 0x7ffff7e3ec00 <llama_sampler_top_p_i>, ctx = 0x55555b145b60}
(gdb) p *smpl->iface
$7 = {name = 0x7ffff7cbc59e <llama_sampler_top_p_name(llama_sampler const*)>, accept = 0x0, apply = 0x7ffff7cbc5af <llama_sampler_top_p_apply(llama_sampler*, llama_token_data_array*)>, reset = 0x0, clone = 0x7ffff7cbc696 <llama_sampler_top_p_clone(llama_sampler const*)>,
free = 0x7ffff7cbc6ca <llama_sampler_top_p_free(llama_sampler*)>}
(gdb) p *smpl->iface->apply
$8 = {void (llama_sampler *, llama_token_data_array *)} 0x7ffff7cbc5af <llama_sampler_top_p_apply(llama_sampler*, llama_token_data_array*)>
(gdb) p *smpl->iface->name
$9 = {const char *(const llama_sampler *)} 0x7ffff7cbc59e <llama_sampler_top_p_name(llama_sampler const*)>
(gdb) p smpl->iface->name(smpl)
$10 = 0x7ffff7d40d6d "top-p"
It's getting late, but this was with a fresh debug build of llama.cpp from today (6f180b91). Steps to reproduce:
$ git clone --branch bartowski-qwen3-14b https://github.com/bjodah/bug-reproducer-llamacpp-assert-triggering
$ cd bug-reproducer-llamacpp-assert-triggering
$ PATH=/build/llama.cpp-debug/bin:$PATH ./reproducer-llamacpp.sh
Can you check if https://github.com/ggml-org/llama.cpp/pull/13822 fixes the issues?
Btw, top-p = 1 and min-p = 1.05 don't make much sense.
Thank you @ggerganov for taking a look!
I am probably misunderstanding the samplers:
--samplers "top_k;dry;min_p;temperature;top_p" \
--min-p 0.005 \
--top-p 0.97 \
Is the issue that I have put top_p after temperature?
@bjodah These parameters are fine - they didn't work because of the bug that is fixed with #13822. I was referring to the parameters that @michmill1970 used - these are equivalent to the much simpler top-k = 1 greedy sampling.
Got it, thanks!