EOT token incorrectly set for Mistral-v0.2 trained with added ChatML tokens
It's setting the EOT to 32000 and saying that 32000 is <|im_end|>, but it's not that for my model. My tokenizer_config.json shows that 32000 is <|im_start|>, which is how I trained it. This also seems to be causing my model to end responses with <|im_start|> instead of <|im_end|>.
I converted using python3 convert-hf-to-gguf.py --outtype bf16 --outfile "./ggml-model-bf16.gguf" "./MyModelDir", then quantized using quantize "./ggml-model-bf16.gguf" "./MyModel-q6_K.gguf" "q6_K"
Link to the model's QLoRA if it matters.
llm_load_print_meta: BOS token = 1 '<s>'
llm_load_print_meta: EOS token = 2 '</s>'
llm_load_print_meta: UNK token = 0 '<unk>'
llm_load_print_meta: PAD token = 2 '</s>'
llm_load_print_meta: LF token = 13 '<0x0A>'
llm_load_print_meta: EOT token = 32000 '<|im_end|>'
"32000": {
"content": "<|im_start|>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false,
"special": false
},
"32001": {
"content": "<|im_end|>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false,
"special": false
}
It's like it's hardcoded to set <|im_start|> to 32001 and <|im_end|> to 32000 even if that's not what the model uses.
It appears your model does not list <|im_start|> or <|im_end|> as a special token. There's logic in llama.cpp if the token is not special.
If you're able, then maybe try adjusting the necessary special tokens to true.
That wouldn't explain this though.
It's like it's hardcoded to set <|im_start|> to 32001 and <|im_end|> to 32000 even if that's not what the model uses.
My model also was not trained with those tokens set as special, so I shouldn't need to change that to get things to work.
The link to your model is 404 not found.
Anyway, did you check if added_tokens.json is set correctly? (The JSON you posted above is from tokenizer_config.json)
The link to your model is 404 not found.
Sorry, I unprivated it.
Anyway, did you check if
added_tokens.jsonis set correctly? (The JSON you posted above is fromtokenizer_config.json)
My model is fine, and the added_tokens.json is also set correctly. The issue here is llama.cpp conversion not matching Transformers at all when it comes to added tokens.
{
"<|im_end|>": 32001,
"<|im_start|>": 32000
}
from transformers import AutoTokenizer
import requests
string_to_test = "<|im_start|>user\nTest Input<|im_end|><|im_start|>assistant\nTest Response<|im_end|>"
tokenizer = AutoTokenizer.from_pretrained("PJMixers/MV02-PB-Mixture-v1-run_15-SFT-7B-Latest-QLoRA")
# Model is converted and quantized with lcpp, running on the latest kcpp
koboldcpp_string_to_test = (
requests.post(
f"http://127.0.0.1:5001/api/extra/tokencount",
json={"prompt": string_to_test},
).json()["ids"]
)
# Transformers output (Correct)
print(tokenizer.encode(string_to_test))
# [1, 32000, 2188, 13, 1963, 11232, 32001, 32000, 13892, 13, 1963, 12107, 32001]
# ['<s>', '<|im_start|>', '▁user', '<0x0A>', 'Test', '▁Input', '<|im_end|>', '<|im_start|>', '▁assistant', '<0x0A>', 'Test', '▁Response', '<|im_end|>']
# KoboldCPP/llama.cpp output (Very incorrect)
print(koboldcpp_string_to_test)
# [1, 32001, 1838, 13, 1963, 11232, 32000, 32001, 489, 11143, 13, 1963, 12107, 32000]
# ['<s>', '<|im_end|>', 'user', '<0x0A>', 'Test', '▁Input', '<|im_start|>', '<|im_end|>', 'ass', 'isstant', '<0x0A>', 'Test', '▁Response', '<|im_start|>']
This issue was closed because it has been inactive for 14 days since being marked as stale.