llama-cli misbehaving (changed?)
I have a colab notebook here to quantize and the test models: https://colab.research.google.com/drive/1TcyGL60GQzsxEHu-Xlos5u8bb_6SxMa3
The simple test has always been this line:
prompt="""
Tell me the difference between thinking in humans and in LLMs.
"""
m=f'{model_name}.{q_type}.gguf'
!./build/bin/llama-cli --ignore-eos -c 4096 -m /content/$m -t $(nproc) -ngl 999 -p "User: Hi\nBot:Hi\nUser: {prompt}\nBot:"
Usually, after the initialization the models start answering. (and then even continuing on their own... which is fine).
Now ( b4762 ) instead it does this:
build: 4762 (af7747c9) with cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 for x86_64-linux-gnu
main: llama backend init
main: load the model and apply lora adapter, if any
llama_model_loader: loaded meta data with 39 key-value pairs and 464 tensors from /content/gemma-2-Ifable-9B.q8_0.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = gemma2
llama_model_loader: - kv 1: general.type str = model
llama_model_loader: - kv 2: general.name str = Gemma 2 Ifable 9B
llama_model_loader: - kv 3: general.organization str = Ifable
llama_model_loader: - kv 4: general.basename str = gemma-2-Ifable
llama_model_loader: - kv 5: general.size_label str = 9B
llama_model_loader: - kv 6: general.license str = gemma
llama_model_loader: - kv 7: general.dataset.count u32 = 1
llama_model_loader: - kv 8: general.dataset.0.name str = Gutenberg Dpo v0.1
llama_model_loader: - kv 9: general.dataset.0.version str = v0.1
llama_model_loader: - kv 10: general.dataset.0.organization str = Jondurbin
llama_model_loader: - kv 11: general.dataset.0.repo_url str = https://huggingface.co/jondurbin/gute...
llama_model_loader: - kv 12: gemma2.context_length u32 = 8192
llama_model_loader: - kv 13: gemma2.embedding_length u32 = 3584
llama_model_loader: - kv 14: gemma2.block_count u32 = 42
llama_model_loader: - kv 15: gemma2.feed_forward_length u32 = 14336
llama_model_loader: - kv 16: gemma2.attention.head_count u32 = 16
llama_model_loader: - kv 17: gemma2.attention.head_count_kv u32 = 8
llama_model_loader: - kv 18: gemma2.attention.layer_norm_rms_epsilon f32 = 0.000001
llama_model_loader: - kv 19: gemma2.attention.key_length u32 = 256
llama_model_loader: - kv 20: gemma2.attention.value_length u32 = 256
llama_model_loader: - kv 21: gemma2.attn_logit_softcapping f32 = 50.000000
llama_model_loader: - kv 22: gemma2.final_logit_softcapping f32 = 30.000000
llama_model_loader: - kv 23: gemma2.attention.sliding_window u32 = 4096
llama_model_loader: - kv 24: tokenizer.ggml.model str = llama
llama_model_loader: - kv 25: tokenizer.ggml.pre str = default
llama_model_loader: - kv 26: tokenizer.ggml.tokens arr[str,256000] = ["<pad>", "<eos>", "<bos>", "<unk>", ...
llama_model_loader: - kv 27: tokenizer.ggml.scores arr[f32,256000] = [-1000.000000, -1000.000000, -1000.00...
llama_model_loader: - kv 28: tokenizer.ggml.token_type arr[i32,256000] = [3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, ...
llama_model_loader: - kv 29: tokenizer.ggml.bos_token_id u32 = 2
llama_model_loader: - kv 30: tokenizer.ggml.eos_token_id u32 = 1
llama_model_loader: - kv 31: tokenizer.ggml.unknown_token_id u32 = 3
llama_model_loader: - kv 32: tokenizer.ggml.padding_token_id u32 = 0
llama_model_loader: - kv 33: tokenizer.ggml.add_bos_token bool = true
llama_model_loader: - kv 34: tokenizer.ggml.add_eos_token bool = false
llama_model_loader: - kv 35: tokenizer.chat_template str = {{ '<bos>' }}{% if messages[0]['role'...
llama_model_loader: - kv 36: tokenizer.ggml.add_space_prefix bool = false
llama_model_loader: - kv 37: general.quantization_version u32 = 2
llama_model_loader: - kv 38: general.file_type u32 = 7
llama_model_loader: - type f32: 169 tensors
llama_model_loader: - type f16: 1 tensors
llama_model_loader: - type q8_0: 294 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type = Q8_0
print_info: file size = 9.95 GiB (9.25 BPW)
load: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
load: special tokens cache size = 217
load: token to piece cache size = 1.6014 MB
print_info: arch = gemma2
print_info: vocab_only = 0
print_info: n_ctx_train = 8192
print_info: n_embd = 3584
print_info: n_layer = 42
print_info: n_head = 16
print_info: n_head_kv = 8
print_info: n_rot = 256
print_info: n_swa = 4096
print_info: n_embd_head_k = 256
print_info: n_embd_head_v = 256
print_info: n_gqa = 2
print_info: n_embd_k_gqa = 2048
print_info: n_embd_v_gqa = 2048
print_info: f_norm_eps = 0.0e+00
print_info: f_norm_rms_eps = 1.0e-06
print_info: f_clamp_kqv = 0.0e+00
print_info: f_max_alibi_bias = 0.0e+00
print_info: f_logit_scale = 0.0e+00
print_info: n_ff = 14336
print_info: n_expert = 0
print_info: n_expert_used = 0
print_info: causal attn = 1
print_info: pooling type = 0
print_info: rope type = 2
print_info: rope scaling = linear
print_info: freq_base_train = 10000.0
print_info: freq_scale_train = 1
print_info: n_ctx_orig_yarn = 8192
print_info: rope_finetuned = unknown
print_info: ssm_d_conv = 0
print_info: ssm_d_inner = 0
print_info: ssm_d_state = 0
print_info: ssm_dt_rank = 0
print_info: ssm_dt_b_c_rms = 0
print_info: model type = 9B
print_info: model params = 9.24 B
print_info: general.name = Gemma 2 Ifable 9B
print_info: vocab type = SPM
print_info: n_vocab = 256000
print_info: n_merges = 0
print_info: BOS token = 2 '<bos>'
print_info: EOS token = 1 '<eos>'
print_info: EOT token = 107 '<end_of_turn>'
print_info: UNK token = 3 '<unk>'
print_info: PAD token = 0 '<pad>'
print_info: LF token = 227 '<0x0A>'
print_info: EOG token = 1 '<eos>'
print_info: EOG token = 107 '<end_of_turn>'
print_info: max token length = 48
load_tensors: loading model tensors, this can take a while... (mmap = true)
load_tensors: offloading 42 repeating layers to GPU
load_tensors: offloading output layer to GPU
load_tensors: offloaded 43/43 layers to GPU
load_tensors: CPU_Mapped model buffer size = 10186.44 MiB
....................................................................................
llama_init_from_model: n_seq_max = 1
llama_init_from_model: n_ctx = 4096
llama_init_from_model: n_ctx_per_seq = 4096
llama_init_from_model: n_batch = 2048
llama_init_from_model: n_ubatch = 512
llama_init_from_model: flash_attn = 0
llama_init_from_model: freq_base = 10000.0
llama_init_from_model: freq_scale = 1
llama_init_from_model: n_ctx_per_seq (4096) < n_ctx_train (8192) -- the full capacity of the model will not be utilized
llama_kv_cache_init: kv_size = 4096, offload = 1, type_k = 'f16', type_v = 'f16', n_layer = 42, can_shift = 1
llama_kv_cache_init: CPU KV buffer size = 1344.00 MiB
llama_init_from_model: KV self size = 1344.00 MiB, K (f16): 672.00 MiB, V (f16): 672.00 MiB
llama_init_from_model: CPU output buffer size = 0.98 MiB
llama_init_from_model: CPU compute buffer size = 514.00 MiB
llama_init_from_model: graph nodes = 1690
llama_init_from_model: graph splits = 1
common_init_from_params: added <eos> logit bias = -inf
common_init_from_params: added <end_of_turn> logit bias = -inf
common_init_from_params: setting dry_penalty_last_n to ctx_size = 4096
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
main: llama threadpool init, n_threads = 2
main: chat template is available, enabling conversation mode (disable it with -no-cnv)
main: chat template example:
<start_of_turn>user
You are a helpful assistant
Hello<end_of_turn>
<start_of_turn>model
Hi there<end_of_turn>
<start_of_turn>user
How are you?<end_of_turn>
<start_of_turn>model
system_info: n_threads = 2 (n_threads_batch = 2) / 2 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | LLAMAFILE = 1 | OPENMP = 1 | AARCH64_REPACK = 1 |
main: interactive mode on.
sampler seed: 3895428166
sampler params:
repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000
dry_multiplier = 0.000, dry_base = 1.750, dry_allowed_length = 2, dry_penalty_last_n = 4096
top_k = 40, top_p = 0.950, min_p = 0.050, xtc_probability = 0.000, xtc_threshold = 0.100, typical_p = 1.000, top_n_sigma = -1.000, temp = 0.800
mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
sampler chain: logits -> logit-bias -> penalties -> dry -> top-k -> typical -> top-p -> min-p -> xtc -> temp-ext -> dist
generate: n_ctx = 4096, n_batch = 2048, n_predict = -1, n_keep = 1
== Running in interactive mode. ==
- Press Ctrl+C to interject at any time.
- Press Return to return control to the AI.
- To return control without starting a new line, end your input with '/'.
- If you want to submit another line, end your input with '\'.
>
Am I doing something wrong?
Note: if I use b4000 everything works as usual.
For reference, this is the output I get with the same model using b4000:
build: 4000 (c02e5ab2) with cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 for x86_64-linux-gnu
main: llama backend init
main: load the model and apply lora adapter, if any
llama_model_loader: loaded meta data with 39 key-value pairs and 464 tensors from /content/gemma-2-Ifable-9B.q8q4.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = gemma2
llama_model_loader: - kv 1: general.type str = model
llama_model_loader: - kv 2: general.name str = Gemma 2 Ifable 9B
llama_model_loader: - kv 3: general.organization str = Ifable
llama_model_loader: - kv 4: general.basename str = gemma-2-Ifable
llama_model_loader: - kv 5: general.size_label str = 9B
llama_model_loader: - kv 6: general.license str = gemma
llama_model_loader: - kv 7: general.dataset.count u32 = 1
llama_model_loader: - kv 8: general.dataset.0.name str = Gutenberg Dpo v0.1
llama_model_loader: - kv 9: general.dataset.0.version str = v0.1
llama_model_loader: - kv 10: general.dataset.0.organization str = Jondurbin
llama_model_loader: - kv 11: general.dataset.0.repo_url str = https://huggingface.co/jondurbin/gute...
llama_model_loader: - kv 12: gemma2.context_length u32 = 8192
llama_model_loader: - kv 13: gemma2.embedding_length u32 = 3584
llama_model_loader: - kv 14: gemma2.block_count u32 = 42
llama_model_loader: - kv 15: gemma2.feed_forward_length u32 = 14336
llama_model_loader: - kv 16: gemma2.attention.head_count u32 = 16
llama_model_loader: - kv 17: gemma2.attention.head_count_kv u32 = 8
llama_model_loader: - kv 18: gemma2.attention.layer_norm_rms_epsilon f32 = 0.000001
llama_model_loader: - kv 19: gemma2.attention.key_length u32 = 256
llama_model_loader: - kv 20: gemma2.attention.value_length u32 = 256
llama_model_loader: - kv 21: gemma2.attn_logit_softcapping f32 = 50.000000
llama_model_loader: - kv 22: gemma2.final_logit_softcapping f32 = 30.000000
llama_model_loader: - kv 23: gemma2.attention.sliding_window u32 = 4096
llama_model_loader: - kv 24: tokenizer.ggml.model str = llama
llama_model_loader: - kv 25: tokenizer.ggml.pre str = default
llama_model_loader: - kv 26: tokenizer.ggml.tokens arr[str,256000] = ["<pad>", "<eos>", "<bos>", "<unk>", ...
llama_model_loader: - kv 27: tokenizer.ggml.scores arr[f32,256000] = [-1000.000000, -1000.000000, -1000.00...
llama_model_loader: - kv 28: tokenizer.ggml.token_type arr[i32,256000] = [3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, ...
llama_model_loader: - kv 29: tokenizer.ggml.bos_token_id u32 = 2
llama_model_loader: - kv 30: tokenizer.ggml.eos_token_id u32 = 1
llama_model_loader: - kv 31: tokenizer.ggml.unknown_token_id u32 = 3
llama_model_loader: - kv 32: tokenizer.ggml.padding_token_id u32 = 0
llama_model_loader: - kv 33: tokenizer.ggml.add_bos_token bool = true
llama_model_loader: - kv 34: tokenizer.ggml.add_eos_token bool = false
llama_model_loader: - kv 35: tokenizer.chat_template str = {{ '<bos>' }}{% if messages[0]['role'...
llama_model_loader: - kv 36: tokenizer.ggml.add_space_prefix bool = false
llama_model_loader: - kv 37: general.quantization_version u32 = 2
llama_model_loader: - kv 38: general.file_type u32 = 15
llama_model_loader: - type f32: 169 tensors
llama_model_loader: - type q8_0: 1 tensors
llama_model_loader: - type q4_K: 252 tensors
llama_model_loader: - type q6_K: 42 tensors
llm_load_vocab: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
llm_load_vocab: special tokens cache size = 217
llm_load_vocab: token to piece cache size = 1.6014 MB
llm_load_print_meta: format = GGUF V3 (latest)
llm_load_print_meta: arch = gemma2
llm_load_print_meta: vocab type = SPM
llm_load_print_meta: n_vocab = 256000
llm_load_print_meta: n_merges = 0
llm_load_print_meta: vocab_only = 0
llm_load_print_meta: n_ctx_train = 8192
llm_load_print_meta: n_embd = 3584
llm_load_print_meta: n_layer = 42
llm_load_print_meta: n_head = 16
llm_load_print_meta: n_head_kv = 8
llm_load_print_meta: n_rot = 256
llm_load_print_meta: n_swa = 4096
llm_load_print_meta: n_embd_head_k = 256
llm_load_print_meta: n_embd_head_v = 256
llm_load_print_meta: n_gqa = 2
llm_load_print_meta: n_embd_k_gqa = 2048
llm_load_print_meta: n_embd_v_gqa = 2048
llm_load_print_meta: f_norm_eps = 0.0e+00
llm_load_print_meta: f_norm_rms_eps = 1.0e-06
llm_load_print_meta: f_clamp_kqv = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale = 0.0e+00
llm_load_print_meta: n_ff = 14336
llm_load_print_meta: n_expert = 0
llm_load_print_meta: n_expert_used = 0
llm_load_print_meta: causal attn = 1
llm_load_print_meta: pooling type = 0
llm_load_print_meta: rope type = 2
llm_load_print_meta: rope scaling = linear
llm_load_print_meta: freq_base_train = 10000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_ctx_orig_yarn = 8192
llm_load_print_meta: rope_finetuned = unknown
llm_load_print_meta: ssm_d_conv = 0
llm_load_print_meta: ssm_d_inner = 0
llm_load_print_meta: ssm_d_state = 0
llm_load_print_meta: ssm_dt_rank = 0
llm_load_print_meta: ssm_dt_b_c_rms = 0
llm_load_print_meta: model type = 9B
llm_load_print_meta: model ftype = Q4_K - Medium
llm_load_print_meta: model params = 9.24 B
llm_load_print_meta: model size = 5.57 GiB (5.17 BPW)
llm_load_print_meta: general.name = Gemma 2 Ifable 9B
llm_load_print_meta: BOS token = 2 '<bos>'
llm_load_print_meta: EOS token = 1 '<eos>'
llm_load_print_meta: EOT token = 107 '<end_of_turn>'
llm_load_print_meta: UNK token = 3 '<unk>'
llm_load_print_meta: PAD token = 0 '<pad>'
llm_load_print_meta: LF token = 227 '<0x0A>'
llm_load_print_meta: EOG token = 1 '<eos>'
llm_load_print_meta: EOG token = 107 '<end_of_turn>'
llm_load_print_meta: max token length = 48
llm_load_tensors: offloading 42 repeating layers to GPU
llm_load_tensors: offloading output layer to GPU
llm_load_tensors: offloaded 43/43 layers to GPU
llm_load_tensors: CPU_Mapped model buffer size = 5700.31 MiB
.....................................................................................
llama_new_context_with_model: n_ctx = 4096
llama_new_context_with_model: n_batch = 2048
llama_new_context_with_model: n_ubatch = 512
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init: CPU KV buffer size = 1344.00 MiB
llama_new_context_with_model: KV self size = 1344.00 MiB, K (f16): 672.00 MiB, V (f16): 672.00 MiB
llama_new_context_with_model: CPU output buffer size = 0.98 MiB
llama_new_context_with_model: CPU compute buffer size = 514.00 MiB
llama_new_context_with_model: graph nodes = 1690
llama_new_context_with_model: graph splits = 1
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
main: llama threadpool init, n_threads = 2
system_info: n_threads = 2 (n_threads_batch = 2) / 2 | AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | AMX_INT8 = 0 | FMA = 1 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | RISCV_VECT = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 |
sampler seed: 172490367
sampler params:
repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000
dry_multiplier = 0.000, dry_base = 1.750, dry_allowed_length = 2, dry_penalty_last_n = -1
top_k = 40, top_p = 0.950, min_p = 0.050, xtc_probability = 0.000, xtc_threshold = 0.100, typical_p = 1.000, temp = 0.800
mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
sampler chain: logits -> logit-bias -> penalties -> dry -> top-k -> typical -> top-p -> min-p -> xtc -> temp-ext -> dist
generate: n_ctx = 4096, n_batch = 2048, n_predict = -1, n_keep = 1
User: Hi
Bot:Hi
User:
Tell me the difference between thinking in humans and in LLMs.
Bot:Here's a breakdown of the key differences between human thinking and how Large Language Models (LLMs) like me "think":
**Human Thinking:**
* **Biological & Subconscious:** Rooted in complex neural networks in the brain, much of human thought is subconscious, emergent, and influenced by emotions, experiences, and bodily sensations.
* **Intuitive & Creative:** Humans excel at making leaps of logic,
llama_perf_sampler_print: sampling time = 36.64 ms / 116 runs ( 0.32 ms per token, 3166.11 tokens per second)
llama_perf_context_print: load time = 30829.81 ms
llama_perf_context_print: prompt eval time = 14709.99 ms / 29 tokens ( 507.24 ms per token, 1.97 tokens per second)
llama_perf_context_print: eval time = 70962.65 ms / 86 runs ( 825.15 ms per token, 1.21 tokens per second)
llama_perf_context_print: total time = 85905.09 ms / 115 tokens
Interrupted by user
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
cmake -B build
cmake --build build --config Release -j$(nproc)
wget https://huggingface.co/ggml-org/gemma-1.1-7b-it-Q4_K_M-GGUF/resolve/main/gemma-1.1-7b-it.Q4_K_M.gguf -O gemma-7b-q4.gguf
./build/bin/llama-cli -m gemma-7b-q4.gguf -c 1024 -p "Once upon a time"
Output:
main: llama threadpool init, n_threads = 6
main: chat template is available, enabling conversation mode (disable it with -no-cnv)
main: chat template example:
<start_of_turn>user
You are a helpful assistant
Hello<end_of_turn>
<start_of_turn>model
Hi there<end_of_turn>
<start_of_turn>user
How are you?<end_of_turn>
<start_of_turn>model
system_info: n_threads = 6 (n_threads_batch = 6) / 12 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | LLAMAFILE = 1 | OPENMP = 1 | AARCH64_REPACK = 1 |
main: interactive mode on.
sampler seed: 3615269160
sampler params:
repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000
dry_multiplier = 0.000, dry_base = 1.750, dry_allowed_length = 2, dry_penalty_last_n = 1024
top_k = 40, top_p = 0.950, min_p = 0.050, xtc_probability = 0.000, xtc_threshold = 0.100, typical_p = 1.000, top_n_sigma = -1.000, temp = 0.800
mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
sampler chain: logits -> logit-bias -> penalties -> dry -> top-k -> typical -> top-p -> min-p -> xtc -> temp-ext -> dist
generate: n_ctx = 1024, n_batch = 2048, n_predict = -1, n_keep = 1
== Running in interactive mode. ==
- Press Ctrl+C to interject at any time.
- Press Return to return control to the AI.
- To return control without starting a new line, end your input with '/'.
- If you want to submit another line, end your input with '\'.
>
Is the documentation outdated?
https://github.com/ggml-org/llama.cpp/blob/master/examples/main/README.md
Yes, ever since 84a4481 conversation mode has been default, you now have to specify -no-cnv to get the old behavior.
-no-cnv
is it -no-cnv or --no-cnv ?
Handyman services zunigas contractor 7047636355
It's -no-cnv or --no-conversation.