Attempt to fix pre-tokenizer
This marks my second effort at resolving the issues with the pre-tokenizer in llama.cpp. I've developed a universal Unicode engine alongside a specialized regex engine. While regex engine has its limitations, only supporting very limited functionalities, it serves our needs well and offers impressive speed.
I have a question regarding tokenizers. Is the falcon model the only one utilizing the BPE tokenizer at this point? I'm asking because if that's not the case, we might encounter some issues.
My concern arises from the potential issues due to the diversity in pre-tokenization among models. The current understanding, as reflected in both the master branch and this pull request, suggests that the bpe_gpt2_preprocess function is exclusively for the falcon model. However, if other models also use BPE, this assumption could lead to complications.
https://github.com/xenova/transformers.js/blob/main/src/tokenizers.js
Did you take a look on the javascript version and python version of transformer? I think it might be useful
Great effort anyway
BTW, when testing the tokenizer on large-scale datasets like Wikitext, the method in llama.cpp (tests) fails. This primarily due to Python-side issues. Specifically, using .encode("utf-8") to convert individual token string to bytes before writing to a file is problematic. Not all tokens represent valid UTF-8 text. Consequently, this results in numerous replacement characters �.
Specifically, using
.encode("utf-8")to convert individual token string to bytes before writing to a file is problematic. Not all tokens represent valid UTF-8 text. Consequently, this results in numerous replacement characters�.
UTF-8 can represent any valid Unicode, so surely this is not an issue with the use of encode - the string must already contain Unicode replacement characters because it was incorrectly decoded (str is a Unicode-aware type in Python 3).
UTF-8 can represent any valid Unicode, so surely this is not an issue with the use of encode - the string must already contain Unicode replacement characters because it was incorrectly decoded (str is a Unicode-aware type in Python 3).
I have a different perspective on this. If the token string already contained Unicode replacement characters, I'm curious how combining two or more such tokens could still result in a valid UTF-8 sequence. It seems counterintuitive, doesn't it? Perhaps we can clarify this with a straightforward experiment to see what actually happens.
UTF-8 can represent any valid Unicode, so surely this is not an issue with the use of encode - the string must already contain Unicode replacement characters because it was incorrectly decoded (str is a Unicode-aware type in Python 3).
@cebtenzzre You are right, I still get ����の����ル��������3. This indeed isn't a problem related to the use of .encode("utf-8"), but rather an issue that arises from using tokenizer.decode() to decode a single token.
What can be done to move this forward?