llama.cpp icon indicating copy to clipboard operation
llama.cpp copied to clipboard

EOT token incorrectly set for Mistral-v0.2 trained with added ChatML tokens

Open xzuyn opened this issue 1 year ago • 2 comments

It's setting the EOT to 32000 and saying that 32000 is <|im_end|>, but it's not that for my model. My tokenizer_config.json shows that 32000 is <|im_start|>, which is how I trained it. This also seems to be causing my model to end responses with <|im_start|> instead of <|im_end|>.

I converted using python3 convert-hf-to-gguf.py --outtype bf16 --outfile "./ggml-model-bf16.gguf" "./MyModelDir", then quantized using quantize "./ggml-model-bf16.gguf" "./MyModel-q6_K.gguf" "q6_K"

Link to the model's QLoRA if it matters.

llm_load_print_meta: BOS token        = 1 '<s>'
llm_load_print_meta: EOS token        = 2 '</s>'
llm_load_print_meta: UNK token        = 0 '<unk>'
llm_load_print_meta: PAD token        = 2 '</s>'
llm_load_print_meta: LF token         = 13 '<0x0A>'
llm_load_print_meta: EOT token        = 32000 '<|im_end|>'
    "32000": {
      "content": "<|im_start|>",
      "lstrip": false,
      "normalized": false,
      "rstrip": false,
      "single_word": false,
      "special": false
    },
    "32001": {
      "content": "<|im_end|>",
      "lstrip": false,
      "normalized": false,
      "rstrip": false,
      "single_word": false,
      "special": false
    }

It's like it's hardcoded to set <|im_start|> to 32001 and <|im_end|> to 32000 even if that's not what the model uses.

xzuyn avatar May 14 '24 01:05 xzuyn

It appears your model does not list <|im_start|> or <|im_end|> as a special token. There's logic in llama.cpp if the token is not special.

If you're able, then maybe try adjusting the necessary special tokens to true.

Jeximo avatar May 14 '24 20:05 Jeximo

That wouldn't explain this though.

It's like it's hardcoded to set <|im_start|> to 32001 and <|im_end|> to 32000 even if that's not what the model uses.

Screenshot_from_2024-05-13_21-21-54

My model also was not trained with those tokens set as special, so I shouldn't need to change that to get things to work.


Also I think the issue you linked is similar/related to an issue page I made the other day about models that uselegacy: true

xzuyn avatar May 15 '24 01:05 xzuyn

The link to your model is 404 not found.

Anyway, did you check if added_tokens.json is set correctly? (The JSON you posted above is from tokenizer_config.json)

ngxson avatar May 17 '24 23:05 ngxson

The link to your model is 404 not found.

Sorry, I unprivated it.

Anyway, did you check if added_tokens.json is set correctly? (The JSON you posted above is from tokenizer_config.json)

My model is fine, and the added_tokens.json is also set correctly. The issue here is llama.cpp conversion not matching Transformers at all when it comes to added tokens.

{
  "<|im_end|>": 32001,
  "<|im_start|>": 32000
}
from transformers import AutoTokenizer
import requests


string_to_test = "<|im_start|>user\nTest Input<|im_end|><|im_start|>assistant\nTest Response<|im_end|>"

tokenizer = AutoTokenizer.from_pretrained("PJMixers/MV02-PB-Mixture-v1-run_15-SFT-7B-Latest-QLoRA")

# Model is converted and quantized with lcpp, running on the latest kcpp
koboldcpp_string_to_test = (
    requests.post(
        f"http://127.0.0.1:5001/api/extra/tokencount",
        json={"prompt": string_to_test},
    ).json()["ids"]
)

# Transformers output (Correct)
print(tokenizer.encode(string_to_test))
# [1, 32000, 2188, 13, 1963, 11232, 32001, 32000, 13892, 13, 1963, 12107, 32001]
# ['<s>', '<|im_start|>', '▁user', '<0x0A>', 'Test', '▁Input', '<|im_end|>', '<|im_start|>', '▁assistant', '<0x0A>', 'Test', '▁Response', '<|im_end|>']

# KoboldCPP/llama.cpp output (Very incorrect)
print(koboldcpp_string_to_test)
# [1, 32001, 1838, 13, 1963, 11232, 32000, 32001, 489, 11143, 13, 1963, 12107, 32000]
# ['<s>', '<|im_end|>', 'user', '<0x0A>', 'Test', '▁Input', '<|im_start|>', '<|im_end|>', 'ass', 'isstant', '<0x0A>', 'Test', '▁Response', '<|im_start|>']

You can also see here the legacy: true issue appear.

xzuyn avatar May 18 '24 05:05 xzuyn

This issue was closed because it has been inactive for 14 days since being marked as stale.

github-actions[bot] avatar Jul 02 '24 01:07 github-actions[bot]