[tokenizer] Inconsistent behavior in slow tokenizer and fast tokenizer
System Info
-
transformersversion: 4.35.2 - Platform: Linux-5.4.0-163-generic-x86_64-with-glibc2.10
- Python version: 3.8.18
- Huggingface_hub version: 0.19.4
- Safetensors version: 0.4.1
- Accelerate version: not installed
- Accelerate config: not found
- PyTorch version (GPU?): 2.1.1+cu121 (True)
- Tensorflow version (GPU?): not installed (NA)
- Flax version (CPU?/GPU?/TPU?): not installed (NA)
- Jax version: not installed
- JaxLib version: not installed
- Using GPU in script?: no need
- Using distributed or parallel set-up in script?: no need
Who can help?
@ArthurZucker and @younesbelkada
Information
- [ ] The official example scripts
- [ ] My own modified scripts
Tasks
- [ ] An officially supported task in the
examplesfolder (such as GLUE/SQuAD, ...) - [ ] My own task or dataset (give details below)
Reproduction
from transformers import AutoTokenizer
def answer_or_exception(tokenizer, id):
print(f'<<<<<<{tokenizer.__class__}>>>>>>')
try:
print(f'"{tokenizer.decode([id])}"')
except Exception as e:
print(e)
tokenizer = AutoTokenizer.from_pretrained("/mnt/data01/shichao/models/phi-2", trust_remote_code=True, use_fast=False)
# vocab size: 50294
answer_or_exception(tokenizer, 50294) # correct
answer_or_exception(tokenizer, 50295) # wrong
tokenizer = AutoTokenizer.from_pretrained("/mnt/data01/shichao/models/phi-2", trust_remote_code=True, use_fast=True)
# vocab size: 50294
answer_or_exception(tokenizer, 50294) # correct
answer_or_exception(tokenizer, 50295) # correct
tokenizer = AutoTokenizer.from_pretrained("/mnt/data01/shichao/models/Llama-2-7b-chat-hf", trust_remote_code=True, use_fast=False)
# vocab size: 31999
answer_or_exception(tokenizer, 31999) # correct
answer_or_exception(tokenizer, 32000) # wrong
tokenizer = AutoTokenizer.from_pretrained("/mnt/data01/shichao/models/Llama-2-7b-chat-hf", trust_remote_code=True, use_fast=True)
# vocab size: 31999
answer_or_exception(tokenizer, 31999) # correct
answer_or_exception(tokenizer, 32000) # correct
Output:
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
<<<<<<<class 'transformers.models.codegen.tokenization_codegen.CodeGenTokenizer'>>>>>>>
" "
<<<<<<<class 'transformers.models.codegen.tokenization_codegen.CodeGenTokenizer'>>>>>>>
sequence item 0: expected str instance, NoneType found
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
<<<<<<<class 'transformers.models.codegen.tokenization_codegen_fast.CodeGenTokenizerFast'>>>>>>>
" "
<<<<<<<class 'transformers.models.codegen.tokenization_codegen_fast.CodeGenTokenizerFast'>>>>>>>
""
<<<<<<<class 'transformers.models.llama.tokenization_llama.LlamaTokenizer'>>>>>>>
"ç»™"
<<<<<<<class 'transformers.models.llama.tokenization_llama.LlamaTokenizer'>>>>>>>
piece id is out of range.
<<<<<<<class 'transformers.models.llama.tokenization_llama_fast.LlamaTokenizerFast'>>>>>>>
"ç»™"
<<<<<<<class 'transformers.models.llama.tokenization_llama_fast.LlamaTokenizerFast'>>>>>>>
""
Expected behavior
Consistent decode behavior in slow tokenizer and fast tokenizer when id exceeds vocab size. For example, instead of raise exceptions, the slow tokenizer output empty strings like the fast tokenizer does.
Hey! Thanks for opening an issue. Few things first. You are using a custom / local checkpoint with trust remote code.
Fast is not erroring out when you feed OOV, while slow is and it is indeed inconsistent. Would you like to open a PR for a fix? 🤗
Yes, I'll try that. Thank you for your reply!
@ArthurZucker @Ki-Seki can I work on it if it's not fixed yet?
@ArthurZucker @Ki-Seki can I work on it if it's not fixed yet?
I'm OK with that. I have other things to do recently.ðŸ˜
Sure 🤗
@ArthurZucker @Ki-Seki Has it been fixed yet? I want to start working on it.
the PR is close so you can probably work on it!
@ArthurZucker What should the behavior be when both tokenizer types encounter an OOV token? Should it simply raise an index error exception, or do you have something else in mind?
Not entirely sure, but imo let's align to what FAST does1
@ArthurZucker We can start a warning about this new change. After a certain version release, we can introduce a custom error exception for OOV word in tokenizer.
Yeah that sounds reasonable
@ArthurZucker
My plan is to start raising a warning in the fast tokenizer implementation so that users know when out-of-vocabulary (OOV) tokens are being ignored. In the normal tokenizer, OOV tokens typically generate different errors by default.
The main challenge with this task is that most users aren't using the base decode method from the tokenization_utils class in their model implementations. Instead, they are overriding it in their custom tokenizer classes.
Could you please review the pull request (#32912) and share your thoughts?