llama.cpp EOT token incorrectly set for Mistral-v0.2 trained with added ChatML tokens

I converted using python3 convert-hf-to-gguf.py --outtype bf16 --outfile "./ggml-model-bf16.gguf" "./MyModelDir", then quantized using quantize "./ggml-model-bf16.gguf" "./MyModel-q6_K.gguf" "q6_K"

Link to the model's QLoRA if it matters.

llm_load_print_meta: BOS token        = 1 '<s>'
llm_load_print_meta: EOS token        = 2 '</s>'
llm_load_print_meta: UNK token        = 0 '<unk>'
llm_load_print_meta: PAD token        = 2 '</s>'
llm_load_print_meta: LF token         = 13 '<0x0A>'
llm_load_print_meta: EOT token        = 32000 '<|im_end|>'

    "32000": {
      "content": "<|im_start|>",
      "lstrip": false,
      "normalized": false,
      "rstrip": false,
      "single_word": false,
      "special": false
    },
    "32001": {
      "content": "<|im_end|>",
      "lstrip": false,
      "normalized": false,
      "rstrip": false,
      "single_word": false,
      "special": false
    }

It's like it's hardcoded to set <|im_start|> to 32001 and <|im_end|> to 32000 even if that's not what the model uses.

May 14 '24 01:05 xzuyn

It appears your model does not list <|im_start|> or <|im_end|> as a special token. There's logic in llama.cpp if the token is not special.

If you're able, then maybe try adjusting the necessary special tokens to true.

May 14 '24 20:05 Jeximo

That wouldn't explain this though.

It's like it's hardcoded to set <|im_start|> to 32001 and <|im_end|> to 32000 even if that's not what the model uses.

Screenshot_from_2024-05-13_21-21-54

My model also was not trained with those tokens set as special, so I shouldn't need to change that to get things to work.

Also I think the issue you linked is similar/related to an issue page I made the other day about models that uselegacy: true

May 15 '24 01:05 xzuyn

The link to your model is 404 not found.

Anyway, did you check if added_tokens.json is set correctly? (The JSON you posted above is from tokenizer_config.json)

May 17 '24 23:05 ngxson

The link to your model is 404 not found.

Sorry, I unprivated it.

Anyway, did you check if added_tokens.json is set correctly? (The JSON you posted above is from tokenizer_config.json)

My model is fine, and the added_tokens.json is also set correctly. The issue here is llama.cpp conversion not matching Transformers at all when it comes to added tokens.

{
  "<|im_end|>": 32001,
  "<|im_start|>": 32000
}

from transformers import AutoTokenizer
import requests


string_to_test = "<|im_start|>user\nTest Input<|im_end|><|im_start|>assistant\nTest Response<|im_end|>"

tokenizer = AutoTokenizer.from_pretrained("PJMixers/MV02-PB-Mixture-v1-run_15-SFT-7B-Latest-QLoRA")

# Model is converted and quantized with lcpp, running on the latest kcpp
koboldcpp_string_to_test = (
    requests.post(
        f"http://127.0.0.1:5001/api/extra/tokencount",
        json={"prompt": string_to_test},
    ).json()["ids"]
)

# Transformers output (Correct)
print(tokenizer.encode(string_to_test))
# [1, 32000, 2188, 13, 1963, 11232, 32001, 32000, 13892, 13, 1963, 12107, 32001]
# ['<s>', '<|im_start|>', '▁user', '<0x0A>', 'Test', '▁Input', '<|im_end|>', '<|im_start|>', '▁assistant', '<0x0A>', 'Test', '▁Response', '<|im_end|>']

# KoboldCPP/llama.cpp output (Very incorrect)
print(koboldcpp_string_to_test)
# [1, 32001, 1838, 13, 1963, 11232, 32000, 32001, 489, 11143, 13, 1963, 12107, 32000]
# ['<s>', '<|im_end|>', 'user', '<0x0A>', 'Test', '▁Input', '<|im_start|>', '<|im_end|>', 'ass', 'isstant', '<0x0A>', 'Test', '▁Response', '<|im_start|>']

You can also see here the legacy: true issue appear.

May 18 '24 05:05 xzuyn

This issue was closed because it has been inactive for 14 days since being marked as stale.

Jul 02 '24 01:07 github-actions[bot]