llama.cpp Constrained decoding with grammar fails for c4ai-command-r-v01

I am trying to apply constrained decoding for the recently adopted command-r.

Using the most recent master branch (https://github.com/ggerganov/llama.cpp/commit/c47cf414efafb8f60596edc7edb5a2d68065e992) I'm trying to apply the simplest list.

It fails with

libc++abi: terminating due to uncaught exception of type std::out_of_range: unordered_map::at: key not found

Any idea what could go wrong here?

More details:

Log start
main: build = 2447 (c47cf414)
main: built with Apple clang version 15.0.0 (clang-1500.3.9.4) for arm64-apple-darwin23.3.0
main: seed  = 1710686911
llama_model_loader: loaded meta data with 23 key-value pairs and 322 tensors from ~/data/c4ai-command-r-v01/ggml-model-Q4_K_M.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = command-r
llama_model_loader: - kv   1:                               general.name str              = c4ai-command-r-v01
llama_model_loader: - kv   2:                      command-r.block_count u32              = 40
llama_model_loader: - kv   3:                   command-r.context_length u32              = 8192
llama_model_loader: - kv   4:                 command-r.embedding_length u32              = 8192
llama_model_loader: - kv   5:              command-r.feed_forward_length u32              = 22528
llama_model_loader: - kv   6:             command-r.attention.head_count u32              = 64
llama_model_loader: - kv   7:          command-r.attention.head_count_kv u32              = 64
llama_model_loader: - kv   8:                   command-r.rope.freq_base f32              = 8000000.000000
llama_model_loader: - kv   9:     command-r.attention.layer_norm_epsilon f32              = 0.000010
llama_model_loader: - kv  10:                          general.file_type u32              = 15
llama_model_loader: - kv  11:                      command-r.logit_scale f32              = 0.062500
llama_model_loader: - kv  12:                command-r.rope.scaling.type str              = none
llama_model_loader: - kv  13:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  14:                      tokenizer.ggml.tokens arr[str,256000]  = ["<PAD>", "<UNK>", "<CLS>", "<SEP>", ...
llama_model_loader: - kv  15:                  tokenizer.ggml.token_type arr[i32,256000]  = [3, 3, 3, 3, 3, 3, 3, 3, 1, 1, 1, 1, ...
llama_model_loader: - kv  16:                      tokenizer.ggml.merges arr[str,253333]  = ["Ġ Ġ", "Ġ t", "e r", "i n", "Ġ a...
llama_model_loader: - kv  17:                tokenizer.ggml.bos_token_id u32              = 5
llama_model_loader: - kv  18:                tokenizer.ggml.eos_token_id u32              = 255001
llama_model_loader: - kv  19:            tokenizer.ggml.padding_token_id u32              = 0
llama_model_loader: - kv  20:               tokenizer.ggml.add_bos_token bool             = true
llama_model_loader: - kv  21:               tokenizer.ggml.add_eos_token bool             = false
llama_model_loader: - kv  22:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:   41 tensors
llama_model_loader: - type q4_K:  240 tensors
llama_model_loader: - type q6_K:   41 tensors
llm_load_vocab: special tokens definition check successful ( 1008/256000 ).
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = command-r
llm_load_print_meta: vocab type       = BPE
llm_load_print_meta: n_vocab          = 256000
llm_load_print_meta: n_merges         = 253333
llm_load_print_meta: n_ctx_train      = 8192
llm_load_print_meta: n_embd           = 8192
llm_load_print_meta: n_head           = 64
llm_load_print_meta: n_head_kv        = 64
llm_load_print_meta: n_layer          = 40
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_embd_head_k    = 128
llm_load_print_meta: n_embd_head_v    = 128
llm_load_print_meta: n_gqa            = 1
llm_load_print_meta: n_embd_k_gqa     = 8192
llm_load_print_meta: n_embd_v_gqa     = 8192
llm_load_print_meta: f_norm_eps       = 1.0e-05
llm_load_print_meta: f_norm_rms_eps   = 0.0e+00
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale    = 6.2e-02
llm_load_print_meta: n_ff             = 22528
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: causal attn      = 1
llm_load_print_meta: pooling type     = 0
llm_load_print_meta: rope type        = 0
llm_load_print_meta: rope scaling     = none
llm_load_print_meta: freq_base_train  = 8000000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_yarn_orig_ctx  = 8192
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: ssm_d_conv       = 0
llm_load_print_meta: ssm_d_inner      = 0
llm_load_print_meta: ssm_d_state      = 0
llm_load_print_meta: ssm_dt_rank      = 0
llm_load_print_meta: model type       = 35B
llm_load_print_meta: model ftype      = Q4_K - Medium
llm_load_print_meta: model params     = 34.98 B
llm_load_print_meta: model size       = 20.04 GiB (4.92 BPW) 
llm_load_print_meta: general.name     = c4ai-command-r-v01
llm_load_print_meta: BOS token        = 5 '<BOS_TOKEN>'
llm_load_print_meta: EOS token        = 255001 '<|END_OF_TURN_TOKEN|>'
llm_load_print_meta: PAD token        = 0 '<PAD>'
llm_load_print_meta: LF token         = 136 'Ä'
llm_load_tensors: ggml ctx size =    0.25 MiB
ggml_backend_metal_buffer_from_ptr: allocated buffer, size = 20519.42 MiB, (20519.48 / 147456.00)
llm_load_tensors: offloading 40 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 41/41 layers to GPU
llm_load_tensors:      Metal buffer size = 20519.41 MiB
llm_load_tensors:        CPU buffer size =  1640.62 MiB
.......................................................................................
llama_new_context_with_model: n_ctx      = 512
llama_new_context_with_model: n_batch    = 512
llama_new_context_with_model: n_ubatch   = 512
llama_new_context_with_model: freq_base  = 8000000.0
llama_new_context_with_model: freq_scale = 1
ggml_metal_init: allocating
ggml_metal_init: found device: Apple M2 Ultra
ggml_metal_init: picking default device: Apple M2 Ultra
ggml_metal_init: default.metallib not found, loading from source
ggml_metal_init: GGML_METAL_PATH_RESOURCES = nil
ggml_metal_init: loading '[...]src/llama.cpp/ggml-metal.metal'
ggml_metal_init: GPU name:   Apple M2 Ultra
ggml_metal_init: GPU family: MTLGPUFamilyApple8  (1008)
ggml_metal_init: GPU family: MTLGPUFamilyCommon3 (3003)
ggml_metal_init: GPU family: MTLGPUFamilyMetal3  (5001)
ggml_metal_init: simdgroup reduction support   = true
ggml_metal_init: simdgroup matrix mul. support = true
ggml_metal_init: hasUnifiedMemory              = true
ggml_metal_init: recommendedMaxWorkingSetSize  = 154618.82 MB
ggml_backend_metal_buffer_type_alloc_buffer: allocated buffer, size =   490.00 MiB, (21011.30 / 147456.00)
llama_kv_cache_init:      Metal KV buffer size =   490.00 MiB
llama_new_context_with_model: KV self size  =  490.00 MiB, K (q8_0):  170.00 MiB, V (f16):  320.00 MiB
llama_new_context_with_model:        CPU  output buffer size =   500.00 MiB
ggml_backend_metal_buffer_type_alloc_buffer: allocated buffer, size =   516.00 MiB, (21527.30 / 147456.00)
llama_new_context_with_model:      Metal compute buffer size =   516.00 MiB
llama_new_context_with_model:        CPU compute buffer size =    17.00 MiB
llama_new_context_with_model: graph splits: 2

system_info: n_threads = 16 / 24 | AVX = 0 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 0 | SSSE3 = 0 | VSX = 0 | MATMUL_INT8 = 0 | 
sampling: 
        repeat_last_n = 64, repeat_penalty = 1.100, frequency_penalty = 0.000, presence_penalty = 0.000
        top_k = 40, tfs_z = 1.000, top_p = 0.950, min_p = 0.050, typical_p = 1.000, temp = 0.800
        mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
sampling order: 
CFG -> Penalties -> top_k -> tfs_z -> typical_p -> top_p -> min_p -> temperature 
generate: n_ctx = 512, n_batch = 2048, n_predict = 500, n_keep = 1


<|START_OF_TURN_TOKEN|><|USER_TOKEN|>Please give me a list of things to do in SF?<|START_OF_TURN_TOKEN|><|CHATBOT_TOKEN|>libc++abi: terminating due to uncaught exception of type std::out_of_range: unordered_map::at: key not found

Mar 17 '24 14:03 CE0110

Can confirm, same issue

Mar 20 '24 09:03 ExtReMLapin

Can confirm for the server too.

curl -Ss --data '{"n_predict":32, "prompt":"Bob: Hi, Alice!\n", "grammar":"root ::= (\"Bob\" | \"Alice\") \":\""}' http://127.0.0.1:8080/completion

{"tid":"140147849643840","timestamp":1713292234,"level":"INFO","function":"launch_slot_with_task","line":1037,"msg":"slot is processing task","id_slot":0,"id_task":0}
{"tid":"140147849643840","timestamp":1713292234,"level":"INFO","function":"update_slots","line":2066,"msg":"kv cache rm [p0, end)","id_slot":0,"id_task":0,"p0":0}
terminate called after throwing an instance of 'std::out_of_range'
  what():  unordered_map::at
fish: Job 1, '~/test/llama.cpp/server -m /opt…' terminated by signal SIGABRT (Abort)

In my case the crash happens because the model returns token 264, and the vocabulary returns an empty string for this token. This empty string then gets UTF8-decoded, but the decoding function llama_decode_text does not account for empty strings and crashes. Particularly, unicode_cpts_from_utf8 strangely returns non-empty result for an empty string, which probably causes the bug down the line.

Don't really know much about Command-R vocab or UTF-8 decoding in llama.cpp, so I'm not really sure what exactly needs to be fixed here: the vocab retuning an empty string, the unicode_cpts_from_utf8 returning non-empty result for an empty string or llama_decode_text trying to decode an empty string at all. Or maybe the issue is even somewhere higher up the chain since, for some reason, it happens only when the grammar is set.

Apr 16 '24 18:04 z80maniac

This empty string then gets UTF8-decoded, but the decoding function llama_decode_text does not account for empty strings and crashes.

I suppose llama_decode_text should return empty string on empty input - does that work?

Apr 16 '24 19:04 ggerganov

The debugger fooled me. It's not actually an empty string, it's a sequence of 3 bytes: e2808d. And it seems like the map created by unicode_utf8_to_byte_map() does not contain this sequence.

Apr 16 '24 19:04 z80maniac

If you ignore the error with try...catch the crash is gone and the output seems valid. Though it does not seem like a proper solution.

diff --git a/llama.cpp b/llama.cpp
index f4f4063c..e278680c 100644
--- a/llama.cpp
+++ b/llama.cpp
@@ -16694,7 +16694,11 @@ static std::string llama_decode_text(const std::string & text) {
     std::string decoded_text;
     auto unicode_sequences = unicode_cpts_from_utf8(text);
     for (auto & unicode_sequence : unicode_sequences) {
-        decoded_text += unicode_utf8_to_byte(unicode_cpt_to_utf8(unicode_sequence));
+        try {
+            decoded_text += unicode_utf8_to_byte(unicode_cpt_to_utf8(unicode_sequence));
+        } catch (const std::out_of_range & e) {
+            LLAMA_LOG_WARN("%s: while decoding Unicode sequence 0x%x: %s (std::out_of_range)\n", __func__, unicode_sequence, e.what());
+        }
     }

     return decoded_text;

The error happens on a lot (4209) of codepoints:

llama_decode_text: while decoding Unicode sequence 0x200d: unordered_map::at (std::out_of_range)
llama_decode_text: while decoding Unicode sequence 0x203c: unordered_map::at (std::out_of_range)
llama_decode_text: while decoding Unicode sequence 0x2049: unordered_map::at (std::out_of_range)
[...]
llama_decode_text: while decoding Unicode sequence 0xe0074: unordered_map::at (std::out_of_range)
llama_decode_text: while decoding Unicode sequence 0xe0077: unordered_map::at (std::out_of_range)
llama_decode_text: while decoding Unicode sequence 0xe007f: unordered_map::at (std::out_of_range)

Apr 17 '24 18:04 z80maniac

Same issue with aya-23-8B

May 24 '24 10:05 ExtReMLapin